{"title": "Quaternion Knowledge Graph Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 2735, "page_last": 2745, "abstract": "In this work, we move beyond the traditional complex-valued representations, introducing more expressive hypercomplex representations to model entities and relations for knowledge graph embeddings. More specifically, quaternion embeddings, hypercomplex-valued embeddings with three imaginary components, are utilized to represent entities. Relations are modelled as rotations in the quaternion space. The advantages of the proposed approach are: (1) Latent inter-dependencies (between all components) are aptly captured with Hamilton product, encouraging a more compact interaction between entities and relations; (2) Quaternions enable expressive rotation in four-dimensional space and have more degree of freedom than rotation in complex plane; (3) The proposed framework is a generalization of ComplEx on hypercomplex space while offering better geometrical interpretations, concurrently satisfying the key desiderata of relational representation learning (i.e., modeling symmetry, anti-symmetry and inversion). Experimental results demonstrate that our method achieves state-of-the-art performance on four well-established knowledge graph completion benchmarks.", "full_text": "Quaternion Knowledge Graph Embeddings\n\nShuai Zhang\u2020\u2217, Yi Tay\u03c8\u2217, Lina Yao\u2020, Qi Liu\u03c6\n\n\u2020 University of New South Wales\n\n\u03c8Nanyang Technological University, \u03c6University of Oxford\n\nAbstract\n\nIn this work, we move beyond the traditional complex-valued representations,\nintroducing more expressive hypercomplex representations to model entities and\nrelations for knowledge graph embeddings. More speci\ufb01cally, quaternion embed-\ndings, hypercomplex-valued embeddings with three imaginary components, are\nutilized to represent entities. Relations are modelled as rotations in the quaternion\nspace. The advantages of the proposed approach are: (1) Latent inter-dependencies\n(between all components) are aptly captured with Hamilton product, encouraging a\nmore compact interaction between entities and relations; (2) Quaternions enable\nexpressive rotation in four-dimensional space and have more degree of freedom\nthan rotation in complex plane; (3) The proposed framework is a generalization of\nComplEx on hypercomplex space while offering better geometrical interpretations,\nconcurrently satisfying the key desiderata of relational representation learning\n(i.e., modeling symmetry, anti-symmetry and inversion). Experimental results\ndemonstrate that our method achieves state-of-the-art performance on four well-\nestablished knowledge graph completion benchmarks.\n\n1\n\nIntroduction\n\nKnowledge graphs (KGs) live at the heart of many semantic applications (e.g., question answering,\nsearch, and natural language processing). KGs enable not only powerful relational reasoning but also\nthe ability to learn structural representations. Reasoning with KGs have been an extremely productive\nresearch direction, with many innovations leading to improvements to many downstream applications.\nHowever, real-world KGs are usually incomplete. As such, completing KGs and predicting missing\nlinks between entities have gained growing interest. Learning low-dimensional representations of\nentities and relations for KGs is an effective solution for this task.\nLearning KG embeddings in the complex space C has been proven to be a highly effective inductive\nbias, largely owing to its intrinsic asymmetrical properties. This is demonstrated by the ComplEx\nembedding method which infers new relational triplets with the asymmetrical Hermitian product.\nIn this paper, we move beyond complex representations, exploring hypercomplex space for learning\nKG embeddings. More concretely, quaternion embeddings are utilized to represent entities and\nrelations. Each quaternion embedding is a vector in the hypercomplex space H with three imaginary\ncomponents i, j, k, as opposed to the standard complex space C with a single real component r and\nimaginary component i. We propose a new scoring function, where the head entity Qh is rotated by\nthe relational quaternion embedding through Hamilton product. This is followed by a quaternion\ninner product with the tail entity Qt.\nThere are numerous bene\ufb01ts of this formulation. (1) The Hamilton operator provides a greater extent\nof expressiveness compared to the complex Hermitian operator and the inner product in Euclidean\nspace. The Hamilton operator forges inter-latent interactions between all of r, i, j, k, resulting in\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa highly expressive model. (2) Quaternion representations are highly desirable for parameterizing\nsmooth rotation and spatial transformations in vector space. They are generally considered robust to\nsheer/scaling noise and perturbations (i.e., numerically stable rotations) and avoid the problem of\nGimbal locks. Moreover, quaternion rotations have two planes of rotation2 while complex rotations\nonly work on single plane, giving the model more degrees of freedom. (3) Our QuatE framework\nsubsumes the ComplEx method, concurrently inheriting its attractive properties such as its ability\nto model symmetry, anti-symmetry, and inversion. (4) Our model can maintain equal or even less\nparameterization, while outperforming previous work.\nExperimental results demonstrate that our method achieves state-of-the-art performance on four well-\nestablished knowledge graph completion benchmarks (WN18, FB15K, WN18RR, and FB15K-237).\n\n2 Related Work\n\nKnowledge graph embeddings have attracted intense research focus in recent years, and a myriad of\nembedding methodologies have been proposed. We roughly divide previous work into translational\nmodels and semantic matching models based on the scoring function, i.e. the composition over head\n& tail entities and relations.\nTranslational methods popularized by TransE [Bordes et al., 2013] are widely used embedding\nmethods, which interpret relation vectors as translations in vector space, i.e., head + relation \u2248 tail.\nA number of models aiming to improve TransE are proposed subsequently. TransH [Wang et al.,\n2014] introduces relation-speci\ufb01c hyperplanes with a normal vector. TransR [Lin et al., 2015]\nfurther introduces relation-speci\ufb01c space by modelling entities and relations in distinct space with\na shared projection matrix. TransD [Ji et al., 2015] uses independent projection vectors for each\nentity and relation and can reduce the amount of calculation compared to TransR. TorusE [Ebisu and\nIchise, 2018] de\ufb01nes embeddings and distance function in a compact Lie group, torus, and shows\nbetter accuracy and scalability. The recent state-of-the-art, RotatE [Sun et al., 2019], proposes a\nrotation-based translational method with complex-valued embeddings.\nOn the other hand, semantic matching models include bilinear models such as RESCAL [Nickel et al.,\n2011], DistMult [Yang et al., 2014], HolE [Nickel et al., 2016], and ComplEx [Trouillon et al., 2016],\nand neural-network-based models. These methods measure plausibility by matching latent semantics\nof entities and relations. In RESCAL, each relation is represented with a square matrix, while\nDistMult replaces it with a diagonal matrix in order to reduce the complexity. SimplE [Kazemi and\nPoole, 2018] is also a simple yet effective bilinear approach for knowledge graph embedding. HolE\nexplores the holographic reduced representations and makes use of circular correlation to capture\nrich interactions between entities. ComplEx embeds entities and relations in complex space and\nutilizes Hermitian product to model the antisymmetric patterns, which has shown to be immensely\nhelpful in learning KG representations. The scoring function of ComplEx is isomorphic to that of\nHolE [Trouillon and Nickel, 2017]. Neural networks based methods have also been adopted, e.g.,\nNeural Tensor Network [Socher et al., 2013] and ER-MLP [Dong et al., 2014] are two representative\nneural network based methodologies. More recently, convolution neural networks [Dettmers et al.,\n2018], graph convolutional networks [Schlichtkrull et al., 2018], and deep memory networks [Wang\net al., 2018] also show promising performance on this task.\nDifferent from previous work, QuatE takes the advantages (e.g., its geometrical meaning and rich\nrepresentation capability, etc.) of quaternion representations to enable rich and expressive semantic\nmatching between head and tail entities, assisted by relational rotation quaternions. Our framework\nsubsumes DistMult and ComplEx, with the capability to generalize to more advanced hypercomplex\nspaces. QuatE utilizes the concept of geometric rotation. Unlike the RotatE which has only one plane\nof rotation, there are two planes of rotation in QuatE. QuatE is a semantic matching model while\nRotatE is a translational model. We also point out that the composition property introduced in TransE\nand RotatE can have detrimental effects on the KG embedding task.\nQuaternion is a hypercomplex number system \ufb01rstly described by Hamilton [Hamilton, 1844] with ap-\nplications in a wide variety of areas including astronautics, robotics, computer visualisation, animation\nand special effects in movies, and navigation. Lately, Quaternions have attracted attention in the \ufb01eld\nof machine learning. Quaternion recurrent neural networks (QRNNs) obtain better performance with\n\n2A plane of rotation is an abstract object used to describe or visualize rotations in space.\n\n2\n\n\ffewer number of free parameters than traditional RNNs on the phoneme recognition task. Quaternion\nrepresentations are also useful for enhancing the performance of convolutional neural networks on\nmultiple tasks such as automatic speech recognition [Parcollet et al.] and image classi\ufb01cation [Gaudet\nand Maida, 2018, Parcollet et al., 2018a]. Quaternion multiplayer perceptron [Parcollet et al., 2016]\nand quaternion autoencoders [Parcollet et al., 2017] also outperform standard MLP and autoencoder.\nIn a nutshell, the major motivation behind these models is that quaternions enable the neural networks\nto code latent inter- and intra-dependencies between multidimensional input features, thus, leading to\nmore compact interactions and better representation capability.\n\n3 Hamilton\u2019s Quaternions\n\nQuaternion [Hamilton, 1844] is a representative of hypercomplex number system, extending tra-\nditional complex number system to four-dimensional space. A quaternion Q consists of one real\ncomponent and three imaginary components, de\ufb01ned as Q = a + bi + cj + dk, where a, b, c, d are real\nnumbers and i, j, k are imaginary units. i, j and k are square roots of \u22121, satisfying the Hamilton\u2019s\nrules: i2 = j2 = k2 = ijk = \u22121. More useful relations can be derived based on these rules, such\nas ij = k, ji = -k, jk=i, ki=j, kj=-i and ik=-j. Figure 1(b) shows the quaternion imaginary units\nproduct. Apparently, the multiplication between imaginary units is non-commutative. Some widely\nused operations of quaternion algebra H are introduced as follows:\nConjugate: The conjugate of a quaternion Q is de\ufb01ned as \u00afQ = a \u2212 bi \u2212 cj \u2212 dk.\nNorm: The norm of a quaternion is de\ufb01ned as |Q| =\nInner Product: The quaternion inner product between Q1 = a1 + b1i + c1j + d1k and Q2 =\na2 + b2i + c2j + d2k is obtained by taking the inner products between corresponding scalar and\nimaginary components and summing up the four inner products:\n\na2 + b2 + c2 + d2.\n\n\u221a\n\nQ1 \u00b7 Q2 = (cid:104)a1, a2(cid:105) + (cid:104)b1, b2(cid:105) + (cid:104)c1, c2(cid:105) + (cid:104)d1, d2(cid:105)\n\n(1)\nHamilton Product (Quaternion Multiplication): The Hamilton product is composed of all the\nstandard multiplications of factors in quaternions and follows the distributive law, de\ufb01ned as:\n\n(2)\n\n(3)\n\nQ1 \u2297 Q2 = (a1a2 \u2212 b1b2 \u2212 c1c2 \u2212 d1d2) + (a1b2 + b1a2 + c1d2 \u2212 d1c2)i\n+ (a1c2 \u2212 b1d2 + c1a2 + d1b2)j + (a1d2 + b1c2 \u2212 c1b2 + d1a2)k,\n\nwhich determines another quaternion. Hamilton product is not commutative. Spatial rotations can be\nmodelled with quaternions Hamilton product. Multiplying a quaternion, Q2, by another quaternion\nQ1, has the effect of scaling Q1 by the magnitude of Q2 followed by a special type of rotation in four\ndimensions. As such, we can also rewrite the above equation as:\n\nQ1 \u2297 Q2 = Q1 \u2297 |Q2|(cid:16) Q2|Q2|\n\n(cid:17)\n\n4 Method\n\n4.1 Quaternion Representations for Knowledge Graph Embeddings\nSuppose that we have a knowledge graph G consisting of N entities and M relations. E and R denote\nthe sets of entities and relations, respectively. The training set consists of triplets (h, r, t), where\nh, t \u2208 E and r \u2208 R. We use \u2126 and \u2126(cid:48) = E \u00d7 R \u00d7 E \u2212 \u2126 to denote the set of observed triplets and\nthe set of unobserved triplets, respectively. Yhrt \u2208 {\u22121, 1} represents the corresponding label of\nthe triplet (h, r, t). The goal of knowledge graph embeddings is to embed entities and relations to a\ncontinuous low-dimensional space, while preserving graph relations and semantics.\nIn this paper, we propose learning effective representations for entities and relations with quaternions.\nWe leverage the expressive rotational capability of quaternions. Unlike RotatE which has only one\nplane of rotation (i.e., complex plane, shown in Figure 1(a)), QuatE has two planes of rotation.\nCompared to Euler angles, quaternion can avoid the problem of gimbal lock (loss of one degree of\nfreedom). Quaternions are also more ef\ufb01cient and numerically stable than rotation matrices. The\nproposed method can be summarized into two steps: (1) rotate the head quaternion using the unit\nrelation quaternion; (2) take the quaternion inner product between the rotated head quaternion and\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Complex plane; (b) Quaternion units product; (c) sterographically projected hypersphere\nin 3D space. The purple dot indicates the position of the unit quaternion.\n\nthe tail quaternion to score each triplet. If a triplet exists in the KG, the model will rotate the head\nentity with the relation to make the angle between head and tail entities smaller so the product can\nbe maximized. Otherwise, we can make the head and tail entity be orthogonal so that their product\nbecomes zero.\n\nQuaternion Embeddings of Knowledge Graphs More speci\ufb01cally, we use a quaternion matrix\nQ \u2208 HN\u00d7k to denote the entity embeddings and W \u2208 HM\u00d7k to denote the relation embeddings,\nwhere k is the dimension of embeddings. Given a triplet (h, r, t), the head entity h and the tail entity\nt correspond to Qh = {ah + bhi + chj + dhk : ah, bh, ch, dh \u2208 Rk} and Qt = {at + bti + ctj + dtk\n: at, bt, ct, dt \u2208 Rk}, respectively, while the relation r is represented by Wr = {ar + bri + crj + drk\n: ar, br, cr, dr \u2208 Rk}.\nHamilton-Product-Based Relational Rotation We \ufb01rst normalize the relation quaternion Wr to\nr = p + qi + uj + vk to eliminate the scaling effect by dividing Wr by its norm:\na unit quaternion W (cid:47)\n(4)\n\nW (cid:47)\n\nr (p, q, u, v) =\n\nWr\n|Wr| =\n\nar + bri + crj + drk\nr + d2\nr\n\nr + c2\n\nr + b2\n\n(cid:112)a2\n\nh, b(cid:48)\n\nh, c(cid:48)\n\nh, d(cid:48)\n\nQ(cid:48)\nh(a(cid:48)\n\nWe visualize a unit quaternion in Figure 1(c) by projecting it into a 3D space. We keep the unit\nhypersphere which passes through i, j, k in place. The unit quaternion can be project in, on, or out of\nthe unit hypersphere depending on the value of the real part.\nSecondly, we rotate the head entity Qh by doing Hamilton product between it and W (cid:47)\nr :\nr = (ah \u25e6 p \u2212 bh \u25e6 q \u2212 ch \u25e6 u \u2212 dh \u25e6 v)\n+ (ah \u25e6 q + bh \u25e6 p + ch \u25e6 v \u2212 dh \u25e6 u)i\n+ (ah \u25e6 u \u2212 bh \u25e6 v + ch \u25e6 p + dh \u25e6 q)j\n+ (ah \u25e6 v + bh \u25e6 u \u2212 ch \u25e6 q + dh \u25e6 p)k\n\nwhere \u25e6 denotes the element-wise multiplication between two vectors. Right-multiplication by a\nunit quaternion is a right-isoclinic rotation on Quaternion Qh. We can also swap Qh and W (cid:47)\nr and do\na left-isoclinic rotation, which does not fundamentally change the geometrical meaning. Isoclinic\nrotation is a special case of double plane rotation where the angles for each plane are equal.\n\nh) = Qh \u2297 W (cid:47)\n\n(5)\n\nScoring Function and Loss We apply the quaternion inner product as the scoring function:\n\nh \u00b7 Qt = (cid:104)a(cid:48)\n\nh, at(cid:105) + (cid:104)b(cid:48)\n\nh, bt(cid:105) + (cid:104)c(cid:48)\n\n(6)\nFollowing Trouillon et al. [2016], we formulate the task as a classi\ufb01cation problem, and the model\nparameters are learned by minimizing the following regularized logistic loss:\nlog(1 + exp(\u2212Yhrt\u03c6(h, r, t))) + \u03bb1 (cid:107) Q (cid:107)2\n\n2 +\u03bb2 (cid:107) W (cid:107)2\n\n(cid:88)\n\nL(Q, W ) =\n\n(7)\n\nh, ct(cid:105) + (cid:104)d(cid:48)\n\nh, dt(cid:105)\n\n2\n\n\u03c6(h, r, t) = Q(cid:48)\n\nr(h,t)\u2208\u2126\u222a\u2126\u2212\n\nHere we use the (cid:96)2 norm with regularization rates \u03bb1 and \u03bb2 to regularize Q and W , respectively. \u2126\u2212\nis sampled from the unobserved set \u2126(cid:48) using negative sampling strategies such as uniform sampling,\nbernoulli sampling [Wang et al., 2014], and adversarial sampling [Sun et al., 2019]. Note that the\nloss function is in Euclidean space, as we take the summation of all components when computing the\nscoring function in Equation (6). We utilise Adagrad [Duchi et al., 2011] for optimization.\n\n4\n\n1i-1-iii=-1 Imaginary Axis Real Axis 1ji-k-1-j-ikij=k ji=-k k-k1i-i-jj\fTable 1: Scoring functions of state-of-the-art knowledge graph embedding models, along with their\nparameters, time complexity. \u201c(cid:63)\" denotes the circular correlation operation; \u201c\u25e6\" denotes Hadmard (or\nelement-wise) product. \u201c\u2297\" denotes Hamilton product.\n\nModel\nTransE\nHolE\nDistMult\nComplEx\nRotatE\nTorusE\nQuatE\n\nScoring Function\n\n(cid:107) (Qh + Wr) \u2212 Qt (cid:107)\n\n(cid:104)Wr, Qh (cid:63) Qt(cid:105)\n(cid:104)Wr, Qh, Qt(cid:105)\nRe((cid:104)Wr, Qh, \u00afQt(cid:105))\n(cid:107) Qh \u25e6 Wr \u2212 Qt (cid:107)\n\nmin(x,y)\u2208([Qh]+[Qh])\u00d7[Wr ] (cid:107) x \u2212 y (cid:107)\n\nQh \u2297 W (cid:47)\n\nr \u00b7 Qt\n\nParameters\n\nQh, Wr, Qt \u2208 Rk\nQh, Wr, Qt \u2208 Rk\nQh, Wr, Qt \u2208 Rk\nQh, Wr, Qt \u2208 Ck\n[Qh], [Wr], [Qt] \u2208 Tk\nQh, Wr, Qt \u2208 Hk\n\nQh, Wr, Qt \u2208 Ck,|Wri| = 1\n\nO(k log(k))\n\nOtime\nO(k)\nO(k)\nO(k)\nO(k)\nO(k)\nO(k)\n\nInitialization For parameters initilaization, we can adopt the initialization algorithm in [Parcollet\net al., 2018b] tailored for quaternion-valued networks to speed up model ef\ufb01ciency and conver-\ngence [Glorot and Bengio, 2010]. The initialization of entities and relations follows the rule:\n\nwreal = \u03d5 cos(\u03b8), wi = \u03d5Q(cid:47)\n\nimgi sin(\u03b8), wj = \u03d5Q(cid:47)\n\nimgj sin(\u03b8), wk = \u03d5Q(cid:47)\n\nimgk sin(\u03b8),\n\n(8)\n\nwhere wreal, wi, wj, wk denote the scalar and imaginary coef\ufb01cients, respectively. \u03b8 is randomly\ngenerated from the interval [\u2212\u03c0, \u03c0]. Q(cid:47)\nimg is a normalized quaternion, whose scalar part is zero. \u03d5 is\nrandomly generated from the interval [\u2212 1\u221a\n], reminiscent to the He initialization [He et al.,\n2k\n2015]. This initialization method is optional.\n\n1\u221a\n2k\n\n,\n\n4.2 Discussion\n\nTable 1 summarizes several popular knowledge graph embedding models, including scoring functions,\nparameters, and time complexities. TransE, HolE, and DistMult use Euclidean embeddings, while\nComplEx and RotatE operate in the complex space. In contrast, our model operates in the quaternion\nspace.\nCapability in Modeling Symmetry, Antisymmetry and Inversion. The \ufb02exibility and representa-\ntional power of quaternions enable us to model major relation patterns at ease. Similar to ComplEx,\nour model can model both symmetry (r(x, y) \u21d2 r(y, x)) and antisymmetry (r(x, y) \u21d2(cid:113)r(y, x))\nrelations. The symmetry property of QuatE can be proved by setting the imaginary parts of Wr to\nzero. One can easily check that the scoring function is antisymmetric when the imaginary parts are\nnonzero.\nAs for the inversion pattern (r1(x, y) \u21d2 r2(y, x)) , we can utilize the conjugation of quaternions.\nConjugation is an involution and is its own inverse. One can easily check that:\n\nQh \u2297 W (cid:47)\n\nr \u00b7 Qt = Qt \u2297 \u00afW (cid:47)\n\nr \u00b7 Qh\n\n(9)\n\nThe detailed proof of antisymmetry and inversion can be found in the appendix.\nComposition patterns are commonplace in knowledge graphs [Lao et al., 2011, Neelakantan et al.,\n2015]. Both transE and RotatE have \ufb01xed composition methods [Sun et al., 2019]. TransE composes\ntwo relations using the addition (r1 + r2) and RotatE uses the Hadamard product (r1 \u25e6 r2). We argue\nthat it is unreasonable to \ufb01x the composition patterns, as there might exist multiple composition\npatterns even in a single knowledge graph. For example, suppose there are three persons \u201cx, y, z\u201d. If\ny is the elder sister (denoted as r1) of x and z is the elder brother (denoted as r2) of y, we can easily\ninfer that z is the elder brother of x. The relation between z and x is r2 instead of r1 + r2 or r1 \u25e6 r2,\nviolating the two composition methods of TransE and RotatE. In QuatE, the composition patterns\nare not \ufb01xed. The relation between z and x is not only determined by relations r1 and r2 but also\nsimultaneously in\ufb02uenced by entity embeddings.\nConnection to DistMult and ComplEx. Quaternions have more degrees of freedom compared\nto complex numbers. Here we show that the QuatE framework can be seen as a generalization of\nComplEx. If we set the coef\ufb01cients of the imaginary units j and k to zero, we get complex embeddings\nas in ComplEx and the Hamilton product will also degrade to complex number multiplication. We\n\n5\n\n\fDataset\nWN18\n\nWN18RR\nFB15K\n\nFB15K-237\n\nTable 2: Statistics of the data sets used in this paper.\n\nN\n\n40943\n40943\n14951\n14541\n\nM\n18\n11\n1345\n237\n\n#training\n141442\n86835\n483142\n272115\n\n#validation\n\n5000\n3034\n50000\n17535\n\n#test\n5000\n3134\n59071\n20466\n\navg. #degree\n\n3.45\n2.19\n32.31\n18.71\n\nfurther remove the normalization of the relational quaternion, obtaining the following equation:\n\n\u03c6(h, r, t) = Qh \u2297 Wr \u00b7 Qt = (ah + bhi) \u2297 (ar + bri) \u00b7 (at + bti)\n= [(ah \u25e6 ar \u2212 bh \u25e6 br) + (ah \u25e6 br + bh \u25e6 ar)i] \u00b7 (at + bti)\n= (cid:104)ar, ah, at(cid:105) + (cid:104)ar, bh, bt(cid:105) + (cid:104)br, ah, bt(cid:105) \u2212 (cid:104)br, bh, at(cid:105)\n\n(10)\n\nwhere (cid:104)a, b, c(cid:105) =(cid:80)\n\nk akbkck denotes standard component-wise multi-linear dot product. Equation\n10 recovers the form of ComplEx. This framework brings another mathematical interpretation for\nComplEx instead of just taking the real part of the Hermitian product. Another interesting \ufb01nding is\nthat Hermitian product is not necessary to formulate the scoring function of ComplEx.\nIf we remove the imaginary parts of all quaternions and remove the normalization step, the scoring\nfunction becomes \u03c6(h, r, t) = (cid:104)ah, ar, at(cid:105), degrading to DistMult in this case.\n\n5 Experiments and Results\n\n5.1 Experimental Setup\n\nDatasets Description: We conducted experiments on four widely used benchmarks, WN18, FB15K,\nWN18RR and FB15K-237, of which the statistics are summarized in Table 2. WN18 [Bordes\net al., 2013] is extracted from WordNet3, a lexical database for English language, where words\nare interlinked by means of conceptual-semantic and lexical relations. WN18RR [Dettmers et al.,\n2018] is a subset of WN18, with inverse relations removed. FB15K [Bordes et al., 2013] contains\nrelation triples from Freebase, a large tuple database with structured general human knowledge.\nFB15K-237 [Toutanova and Chen, 2015] is a subset of FB15K, with inverse relations removed.\nEvaluation Protocol: Three popular evaluation metrics are used, including Mean Rank (MR), Mean\nReciprocal Rank (MRR), and Hit ratio with cut-off values n = 1, 3, 10. MR measures the average\nrank of all correct entities with a lower value representing better performance. MRR is the average\ninverse rank for correct entities. Hit@n measures the proportion of correct entities in the top n entities.\nFollowing Bordes et al. [2013], \ufb01ltered results are reported to avoid possibly \ufb02awed evaluation.\nBaselines: We compared QuatE with a number of strong baselines. For Translational Distance\nModels, we reported TransE [Bordes et al., 2013] and two recent extensions, TorusE [Ebisu and Ichise,\n2018] and RotatE [Sun et al., 2019]; For Semantic Matching Models, we reported DistMult [Yang\net al., 2014], HolE [Nickel et al., 2016], ComplEx [Trouillon et al., 2016] , SimplE [Kazemi and\nPoole, 2018], ConvE [Dettmers et al., 2018], R-GCN [Schlichtkrull et al., 2018], and KNGE (ConvE\nbased) [Wang et al., 2018].\nImplementation Details: We implemented our model using pytorch4 and tested it on a single GPU.\nThe hyper-parameters are determined by grid search. The best models are selected by early stopping\non the validation set. In general, the embedding size k is tuned amongst {50, 100, 200, 250, 300}.\nRegularization rate \u03bb1 and \u03bb2 are searched in {0, 0.01, 0.05, 0.1, 0.2}. Learning rate is \ufb01xed to\n0.1 without further tuning. The number of negatives (#neg) per training sample is selected from\n{1, 5, 10, 20}. We create 10 batches for all the datasets. For most baselines, we report the results in\nthe original papers, and exceptions are provided with references. For RotatE (without self-adversarial\nnegative sampling), we use the best hyper-parameter settings provided in the paper to reproduce the\nresults. We also report the results of RotatE with self-adversarial negative sampling and denote it as\na-RotatE. Note that we report three versions of QuatE: including QuatE with/without type constraints,\nQuatE with N3 regularization and reciprocal learning. Self-adversarial negative sampling [Sun et al.,\n2019] is not used for QuatE. All hyper-parameters of QuatE are provided in the appendix.\n\n3https://wordnet.princeton.edu/\n4https://pytorch.org/\n\n6\n\n\fTable 3: Link prediction results on WN18 and FB15K. Best results are in bold and second best\nresults are underlined. [\u2020]: Results are taken from [Nickel et al., 2016]; [(cid:5)]: Results are taken\nfrom [Kadlec et al., 2017]; [\u2217]: Results are taken from [Sun et al., 2019]. a-RotatE denotes RotatE\nwith self-adversarial negative sampling. [QuatE1]: without type constraints; [QuatE2]: with N3\nregularization and reciprocal learning; [QuatE3]: with type constraints.\n\nModel\nTransE\u2020\nDistMult(cid:5)\n\nHolE\n\nComplEx\nConvE\nR-GCN+\nSimplE\nNKGE\nTorusE\nRotatE\na-RotatE\u2217\nQuatE1\nQuatE2\nQuatE3\n\nMR MRR\n0.495\n0.797\n0.938\n0.941\n0.943\n0.819\n0.942\n0.947\n0.947\n0.947\n0.949\n0.949\n0.950\n0.950\n\n-\n655\n-\n-\n374\n-\n-\n336\n-\n184\n309\n388\n-\n162\n\nWN18\n\nFB15K\n\n-\n\n-\n\n-\n\n-\n\n-\n\n42.2\n\n0.888\n\n0.578\n\n0.113\n\nHit@10 Hit@3 Hit@1 MR MRR Hit@10 Hit@3 Hit@1\n0.943\n0.297\n0.946\n0.949\n0.947\n0.956\n0.964\n0.947\n0.957\n0.954\n0.961\n0.959\n0.960\n0.962\n0.959\n\n0.463\n0.798\n0.524\n0.692\n0.657\n0.696\n0.727\n0.73\n0.733\n0.699\n0.797\n0.770\n0.833\n0.782\n\n0.749\n0.893\n0.739\n0.840\n0.831\n0.842\n0.838\n0.871\n0.832\n0.872\n0.884\n0.878\n0.900\n0.900\n\n0.599\n0.599\n0.558\n0.601\n0.660\n0.650\n0.674\n0.585\n0.746\n0.700\n0.800\n0.711\n\n0.759\n0.759\n0.723\n0.760\n0.773\n0.790\n0.771\n0.788\n0.830\n0.821\n0.859\n0.835\n\n0.930\n0.936\n0.935\n0.697\n0.939\n0.942\n0.943\n0.938\n0.944\n0.941\n0.944\n0.945\n\n0.945\n0.945\n0.946\n0.929\n0.944\n0.949\n0.950\n0.953\n0.952\n0.954\n0.954\n0.954\n\n-\n-\n51\n-\n-\n56\n-\n32\n40\n41\n-\n17\n\nTable 4: Link prediction results on WN18RR and FB15K-237. [\u2020]: Results are taken from [Nguyen\net al., 2017]; [(cid:5)]: Results are taken from [Dettmers et al., 2018]; [\u2217]: Results are taken from [Sun\net al., 2019].\n\nMR MRR\n0.226\n3384\n0.43\n5110\n0.44\n5261\n4187\n0.43\n\n4170\n3277\n3340\n3472\n\n-\n\n-\n\n2314\n\n-\n\n0.45\n0.470\n0.476\n0.481\n0.482\n0.488\n\n-\n\n-\n\n0.44\n0.46\n0.44\n\n0.39\n0.41\n0.40\n\nWN18RR\nHit@10 Hit@3 Hit@1 MR MRR\n0.294\n0.501\n0.241\n0.49\n0.247\n0.51\n0.325\n0.52\n0.249\n0.33\n0.297\n0.338\n0.311\n0.366\n0.348\n\n357\n254\n339\n244\n-\n237\n185\n177\n176\n-\n87\n\n0.465\n0.488\n0.492\n0.500\n0.499\n0.508\n\n0.421\n0.422\n0.428\n0.436\n0.436\n0.438\n\n0.526\n0.565\n0.571\n0.564\n0.572\n0.582\n\n-\n\n-\n\n-\n\n-\n\n-\n\nFB15K-237\nHit@10 Hit@3 Hit@1\n0.465\n0.419\n0.428\n0.501\n0.417\n0.510\n0.480\n0.533\n0.495\n0.556\n0.550\n\n0.155\n0.158\n0.237\n0.151\n0.241\n0.205\n0.241\n0.221\n0.271\n0.248\n\n0.263\n0.275\n0.356\n0.264\n0.365\n0.328\n0.375\n0.342\n0.401\n0.382\n\nModel\nTransE \u2020\nDistMult(cid:5)\nComplEx(cid:5)\nConvE(cid:5)\nR-GCN+\nNKGE\nRotatE\u2217\na-RotatE\u2217\nQuatE1\nQuatE2\nQuatE3\n\n5.2 Results\n\nTable 5: MRR for the models tested\non each relation of WN18RR.\n\nThe empirical results on four datasets are reported in Table 3\nand Table 4. QuatE performs extremely competitively com-\npared to the existing state-of-the-art models across all metrics.\nAs a quaternion-valued method, QuatE outperforms the two\nrepresentative complex-valued models ComplEx and RotatE.\nThe performance gains over RotatE also con\ufb01rm the advantages\nof quaternion rotation over rotation in the complex plane.\nOn the WN18 dataset, QuatE outperforms all the baselines on\nall metrics except Hit@10. R-GCN+ achieves high value on\nHit@10, yet is surpassed by most models on the other four\nmetrics. The four recent models NKGE, TorusE, RotaE, and\na-RotatE achieves comparable results. QuatE also achieves the\nbest results on the FB15K dataset, while the second best results scatter amongst RotatE, a-RotatE and\nDistMult. We are well-aware of the good results of DistMult reported in [Kadlec et al., 2017], yet\nthey used a very large negative sampling size (i.e., 1000, 2000). The results also demonstrate that\n\nRelation Name\nhypernym\nderivationally_related_form\ninstance_hypernym\nalso_see\nmember_meronym\nsynset_domain_topic_of\nhas_part\nmember_of_domain_usage\nmember_of_domain_region\nverb_group\nsimilar_to\n\nQuatE3\n0.173\n0.953\n0.364\n0.629\n0.232\n0.468\n0.233\n0.441\n0.193\n0.924\n1.000\n\nRotatE\n0.148\n0.947\n0.318\n0.585\n0.232\n0.341\n0.184\n0.318\n0.200\n0.943\n1.000\n\n7\n\n\fTable 7: Analysis on different variants of scoring function. Same hyperparameters settings as QuatE3\nare used.\n\nWN18\n\nFB15K\n\nWN18RR\n\nFB15K-237\n\nAnalysis\nQh \u2297 Wr \u00b7 Qt\nWr \u00b7 (Qh \u2297 Qt)\n(Qh \u2297 W (cid:47)\n\nr ) \u00b7 (Qt \u2297 V (cid:47)\nr )\n\nMRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10\n0.936\n0.463\n0.446\n0.784\n0.947\n0.539\n\n0.415\n0.401\n0.477\n\n0.272\n0.263\n0.344\n\n0.951\n0.945\n0.958\n\n0.482\n0.471\n0.563\n\n0.866\n0.809\n0.889\n\n0.686\n0.599\n0.787\n\nQuatE can effectively capture the symmetry, antisymmetry and inversion patterns since they account\nfor a large portion of the relations in these two datasets.\nAs shown in Table 4, QuatE achieves a large performance gain over existing state-of-the-art models\non the two datasets where trivial inverse relations are removed. On WN18RR in which there are a\nnumber of symmetry relations, a-RotatE is the second best, while other baselines are relatively weaker.\nThe key competitors on the dataset FB15K-237 where a large number of composition patterns exist\nare NKGE and a-RotatE. Table 5 summarizes the MRR for each relation on WN18RR, con\ufb01rming\nthe superior representation capability of quaternion in modelling different types of relation. Methods\nwith \ufb01xed composition patterns such as TransE and RotatE are relatively weak at times.\nWe can also apply N3 regularization and reciprocal learning approaches [Lacroix et al., 2018] to\nQuatE. Results are shown in Table 3 and Table 4 as QuatE2. It is observed that using N3 and\nreciprocal learning could boost the performances greatly, especially on FB15K and FB15K-237. We\nfound that the N3 regularization method can reduce the norm of relations and entities embeddings so\nthat we do not apply relation normalization here. However, same as the method in [Lacroix et al.,\n2018], QuatE2 requires a large embedding dimension.\n\n5.3 Model Analysis\n\nRotatE\nCk\n\nQuatE1\nHk\n\nTable 6: Number of free parameters comparison.\n\nModel\nSpace\nWN18\nFB15K\nWN18RR\nFB15K-237\n\nTorusE\nTk\n409.61M 40.95M 49.15M (\u2191 20.0%)\n162.96M 31.25M 26.08M(\u2193 16.5%)\n40.95M 16.38M(\u2193 60.0%)\n-\n29.32M 5.82M(\u2193 80.1%)\n-\n\nNumber of Free Parameters Compari-\nson. Table 6 shows the amount of param-\neters comparison between QuatE1 and\ntwo recent competitive baselines: RotatE\nand TorusE. Note that QuatE3 uses al-\nmost the same number of free parame-\nters as QuatE1. TorusE uses a very large\nembedding dimension 10000 for both\nWN18 and FB15K. This number is even\nclose to the entities amount of FB15K which we think is not preferable since our original intention is\nto embed entities and relations to a lower dimensional space. QuatE reduces the parameter size of\nthe complex-valued counterpart RotatE largely. This is more signi\ufb01cant on datasets without trivial\ninverse relations, saving up to 80% parameters while maintaining superior performance.\nAblation Study on Quaternion Normalization. We remove the normalization step in QuatE and\nuse the original relation quaternion Wr to project head entity. From Table 7, we clearly observe that\nnormalizing the relation to unit quaternion is a critical step for the embedding performance. This is\nlikely because scaling effects in nonunit quaternions are detrimental.\nHamilton Products between Head and Tail Entities. We reformulate the scoring function of\nQuatE following the original formulate of ComplEx. We do Hamilton product between head and tail\nquaternions and consider the relation quaternion as weight. Thus, we have \u03c6(h, r, t) = Wr\u00b7(Qh\u2297Qt).\nAs a result, the geometric property of relational rotation is lost, which leads to poor performance as\nshown in Table 7.\nAdditional Rotational Quaternion for Tail Entity. We hypothesize that adding an additional\nrelation quaternion to tail entity might bring the model more representation capability. So we revise\nthe scoring function to (Qh \u2297 W (cid:47)\nr ), where Vr represents the rotational quaternion for\ntail entity. From Table 7, we observe that it achieves competitive results without extensive tuning.\nHowever, it might cause some losses of ef\ufb01ciency.\n\nr ) \u00b7 (Qt \u2297 V (cid:47)\n\n8\n\n\f6 Conclusion\n\nIn this paper, we design a new knowledge graph embedding model which operates on the quaternion\nspace with well-de\ufb01ned mathematical and physical meaning. Our model is advantageous with its\ncapability in modelling several key relation patterns, expressiveness with higher degrees of freedom\nas well as its good generalization. Empirical experimental evaluations on four well-established\ndatasets show that QuatE achieves an overall state-of-the-art performance, outperforming multiple\nrecent strong baselines, with even fewer free parameters.\n\nAcknowledgments\n\nThis research was partially supported by grant ONRG NICOP N62909-19-1-2009\n\nReferences\nAntoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In Advances in neural information\nprocessing systems, pages 2787\u20132795, 2013.\n\nTim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d\nknowledge graph embeddings. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nXin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas\nStrohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to proba-\nbilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 601\u2013610. ACM, 2014.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nTakuma Ebisu and Ryutaro Ichise. Toruse: Knowledge graph embedding on a lie group. In Thirty-\n\nSecond AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nChase J Gaudet and Anthony S Maida. Deep quaternion networks. In 2018 International Joint\n\nConference on Neural Networks (IJCNN), pages 1\u20138. IEEE, 2018.\n\nXavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and\nstatistics, pages 249\u2013256, 2010.\n\nWilliam Rowan Hamilton. Lxxviii. on quaternions; or on a new system of imaginaries in algebra:\nTo the editors of the philosophical magazine and journal. The London, Edinburgh, and Dublin\nPhilosophical Magazine and Journal of Science, 25(169):489\u2013495, 1844.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international\nconference on computer vision, pages 1026\u20131034, 2015.\n\nGuoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via\ndynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language\nProcessing (Volume 1: Long Papers), volume 1, pages 687\u2013696, 2015.\n\nRudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselines strike\n\nback. ACL 2017, page 69, 2017.\n\nSeyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs.\n\nIn Advances in Neural Information Processing Systems, pages 4289\u20134300, 2018.\n\nTimothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for\nknowledge base completion. In International Conference on Machine Learning, pages 2869\u20132878,\n2018.\n\n9\n\n\fNi Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scale\nknowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing, pages 529\u2013539. Association for Computational Linguistics, 2011.\n\nYankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation\nembeddings for knowledge graph completion. In Twenty-ninth AAAI conference on arti\ufb01cial\nintelligence, 2015.\n\nArvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models\n\nfor knowledge base completion. arXiv preprint arXiv:1504.06662, 2015.\n\nDai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A novel embedding\nmodel for knowledge base completion based on convolutional neural network. arXiv preprint\narXiv:1712.02121, 2017.\n\nMaximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning\n\non multi-relational data. In ICML, volume 11, pages 809\u2013816, 2011.\n\nMaximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of knowledge\n\ngraphs. In Thirtieth Aaai conference on arti\ufb01cial intelligence, 2016.\n\nT. Parcollet, M. Morchid, P. Bousquet, R. Dufour, G. Linar\u00e8s, and R. De Mori. Quaternion neural net-\nworks for spoken language understanding. In 2016 IEEE Spoken Language Technology Workshop\n(SLT), pages 362\u2013368, Dec 2016. doi: 10.1109/SLT.2016.7846290.\n\nTitouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linar\u00e8s, Renato\nDe Mori, and Yoshua Bengio. Quaternion convolutional neural networks for end-to-end automatic\nspeech recognition. arXiv preprint arXiv:1806.07789.\n\nTitouan Parcollet, Mohamed Morchid, and Georges Linar\u00e8s. Quaternion denoising encoder-decoder\n\nfor theme identi\ufb01cation of telephone conversations. In INTERSPEECH, 2017.\n\nTitouan Parcollet, Mohamed Morchid, and Georges Linar\u00e8s. Quaternion convolutional neural\nnetworks for heterogeneous image processing. CoRR, abs/1811.02656, 2018a. URL http:\n//arxiv.org/abs/1811.02656.\n\nTitouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linar\u00e8s, Chiheb Trabelsi, Renato De\nMori, and Yoshua Bengio. Quaternion recurrent neural networks. The International Conference on\nLearning Representations, abs/1806.04418, 2018b.\n\nMichael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max\nWelling. Modeling relational data with graph convolutional networks. In European Semantic Web\nConference, pages 593\u2013607. Springer, 2018.\n\nRichard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor\nnetworks for knowledge base completion. In Advances in neural information processing systems,\npages 926\u2013934, 2013.\n\nZhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding\nby relational rotation in complex space. In The Seventh International Conference on Learning\nRepresentations, 2019.\n\nKristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text\ninference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their\nCompositionality, pages 57\u201366, 2015.\n\nTh\u00e9o Trouillon and Maximilian Nickel. Complex and holographic embeddings of knowledge graphs:\n\na comparison. arXiv preprint arXiv:1707.01475, 2017.\n\nTh\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard. Complex\nembeddings for simple link prediction. In International Conference on Machine Learning, pages\n2071\u20132080, 2016.\n\n10\n\n\fKai Wang, Yu Liu, Xiujuan Xu, and Dan Lin. Knowledge graph embedding with entity neighbors\n\nand deep memory network. arXiv preprint arXiv:1808.03752, 2018.\n\nZhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by\n\ntranslating on hyperplanes. In Twenty-Eighth AAAI conference on arti\ufb01cial intelligence, 2014.\n\nBishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and\nrelations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1565, "authors": [{"given_name": "SHUAI", "family_name": "ZHANG", "institution": "University of New South Wales"}, {"given_name": "Yi", "family_name": "Tay", "institution": "Nanyang Technological University"}, {"given_name": "Lina", "family_name": "Yao", "institution": "UNSW"}, {"given_name": "Qi", "family_name": "Liu", "institution": "Facebook AI Research"}]}