{"title": "Expanding Holographic Embeddings for Knowledge Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 4491, "page_last": 4501, "abstract": "Neural models operating over structured spaces such as knowledge graphs require a continuous embedding of the discrete elements of this space (such as entities) as well as the relationships between them. Relational embeddings with high expressivity, however, have high model complexity, making them computationally difficult to train. We propose a new family of embeddings for knowledge graphs that interpolate between a method with high model complexity and one, namely Holographic embeddings (HolE), with low dimensionality and high training efficiency. This interpolation, termed HolEx, is achieved by concatenating several linearly perturbed copies of original HolE. We formally characterize the number of perturbed copies needed to provably recover the full entity-entity or entity-relation interaction matrix, leveraging ideas from Haar wavelets and compressed sensing. In practice, using just a handful of Haar-based or random perturbation vectors results in a much stronger knowledge completion system. On the Freebase FB15K dataset, HolEx outperforms originally reported HolE by 14.7\\% on the HITS@10 metric, and the current path-based state-of-the-art method, PTransE, by 4\\% (absolute).", "full_text": "Expanding Holographic Embeddings\n\nfor Knowledge Completion\n\nYexiang Xue(cid:63)\n\nYang Yuan\u2020\n\nZhitian Xu(cid:63)\n\nAshish Sabharwal\u2021\n\n(cid:63) Dept. of Computer Science, Purdue University, West Lafayette, IN, USA\n\n\u2020 Dept. of Computer Science, Cornell University, Ithaca, NY, USA\n\u2021 Allen Institute for Arti\ufb01cial Intelligence (AI2), Seattle, WA, USA\n\nAbstract\n\nNeural models operating over structured spaces such as knowledge graphs require\na continuous embedding of the discrete elements of this space (such as entities)\nas well as the relationships between them. Relational embeddings with high\nexpressivity, however, have high model complexity, making them computationally\ndif\ufb01cult to train. We propose a new family of embeddings for knowledge graphs\nthat interpolate between a method with high model complexity and one, namely\nHolographic embeddings (HOLE), with low dimensionality and high training\nef\ufb01ciency. This interpolation, termed HOLEX, is achieved by concatenating several\nlinearly perturbed copies of original HOLE. We formally characterize the number of\nperturbed copies needed to provably recover the full entity-entity or entity-relation\ninteraction matrix, leveraging ideas from Haar wavelets and compressed sensing. In\npractice, using just a handful of Haar-based or random perturbation vectors results\nin a much stronger knowledge completion system. On the Freebase FB15K dataset,\nHOLEX outperforms originally reported HOLE by 14.7% on the HITS@10 metric,\nand the current path-based state-of-the-art method, PTransE, by 4% (absolute).\n\n1\n\nIntroduction\n\nRelations, as a key concept in arti\ufb01cial intelligence and machine learning, allow human beings as\nwell as intelligent systems to learn and reason about the world. In particular, relations among multiple\nentities and concepts enable us to make logical inference, learn new concepts, draw analogies, make\ncomparisons, etc. This paper considers relational learning for knowledge graphs (KGs), which often\ncontain knowledge in the form of binary relations, such as livesIn(Bill Gates, Seattle). A number of\nvery large KGs, with millions and even billions of facts, have become prominent in the last decade,\nsuch as Freebase [3], DBpedia [2], YAGO [11], WordNet [17], and WebChild [26].\nKGs can be represented as a multigraph, where entities such as Bill Gates and Seattle are nodes,\nconnected with zero or more relations such as livesIn and likes. Facts such as livesIn(Bill Gates,\nSeattle) form typed edges, with the relation\u2014in this case livesIn\u2014being the edge type. In particular,\nwe are interested in the knowledge completion task for KGs: Given an existing KG, we would like to\nuse statistical machine learning tools to extract correlations among its entities and relations, and use\nthese correlations to derive new knowledge about them.\nCompositional vector space models, also referred to as matrix or tensor factorization based methods,\nhave proven to be highly effective for KG completion [e.g., 4, 5, 7, 8, 12, 14\u201316, 18, 19, 23\u201325, 28].\nIn these models, entities and relations are represented as (learned) vectors in a high dimensional space,\nand various forms of compositional operators are used to determine the likelihood of a candidate fact.\nA good design of the compositional operator is often key to the success of the model. Such design\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmust balance computational complexity against model complexity. Not surprisingly, embedding\nmodels capable of capturing rich correlations in relational data often have limited computational\nscalability. On the other hand, models that can be trained ef\ufb01ciently are often less expressive.\nWe focus on two compositional operators. The \ufb01rst is the full tensor product [18], which captures\ncorrelations between every pair of dimensions of two embedding vectors in Rd, by considering their\nouter product. The resulting quadratic (d2) parameter space makes this impractical even for modestly\nsized KGs. The second is the circular correlation underlying holographic embedding or HOLE [19],\nwhich is inspired by holographic models of associated memory. Notably, HOLE keeps the parameter\nspace linear in d by capturing only the sum along each diagonal of the full tensor product matrix.\nOur main contribution is a new compositional operator that combines the strengths of these two\nmodels, resulting in much stronger knowledge completion system. Speci\ufb01cally, we propose expanded\nholographic embeddings or HOLEX, which is a collection of models that interpolates between\nholographic embeddings and the full tensor product.\nThe idea is to concatenate l \u2265 1 copies of the HOLE model, each perturbed by a linear vector,\nallowing various copies to focus on different subspaces of the embedding. HOLEX forms a complete\nspectrum connecting HOLE with the full tensor product model: it falls back to HOLE when l = 1\nand all entries in the perturbation vector are non-zero, and is equivalent to the full tensor product\nmodel when l = d, the embedding dimension, and all perturbation vectors are linearly independent.\nWe consider two families of perturbation vectors, low frequency Haar wavelets [6, 10] and random\n0/1 vectors. We show that using the former corresponds to considering sums of multiple subsequences\nof each diagonal line of the full product matrix, in contrast to the original holographic embedding,\nwhich sums up the entire diagonal. We \ufb01nd that even just a few low frequency vectors in the Haar\nmatrix are quite effective in practice for HOLEX. When using the complete Haar matrix, the length\nof the subsequences becomes one, thereby recovering the full tensor product case. Our second family\nof perturbation vectors, namely random 0/1 vectors, corresponds to randomly sub-selecting half the\nrows of the tensor product matrix in each copy. This is valuable when the full product matrix is\nsparse. Speci\ufb01cally, using techniques from compressed sensing, if each diagonal line is dominated\nby a few large entries (in terms of absolute values), we show that a logarithmic number of random\nvectors suf\ufb01ce to recover information from these large entries.\nTo assess its ef\ufb01cacy, we implement HOLEX using the framework of ProjE [23], a recent neural\nmethod developed for the Freebase FB15K knowledge completion dataset [3, 5], where the 95%\ncon\ufb01dence interval for statistical signi\ufb01cance is 0.3%. In terms of the standard HITS@10 metric,\nHOLEX using 16 random 0/1 vectors outperforms the original HOLE by 14.7% (absolute), ProjE by\n5.7%, and a path-based state-of-the-art method by 4%.\n\n2 Preliminaries\n\nWe use knowledge graphs to predict new relations between entities. For example, given entities\nAlbany and the New York State, possible relationships between these two entities are CityIn and\nCapitalOf. Formally, let E denote the set of all entities in a KG G. A relation r is a subset of E \u00d7 E,\ncorresponding to all entity pairs that satisfy the relation. For example, the relation CapitalOf contains\nall (City, State) pairs in which the City is the capital of that particular State. For each relation r, we\nwould like to learn the characterization function for r, \u03c6r(s, o), which evaluates to +1 if the entity\npair (s, o) is in the relation set, otherwise, -1. Notice that s and o are typically asymmetrical. For\nexample, Albany is the capital of the New York State, but not the other way around. Relations can be\nvisualized as a knowledge graph, where the nodes represent entities, and one relation corresponds to\na set of edges connecting entity pairs with the given relation.\nAs mentioned earlier, compositional embeddings are useful models for prediction in knowledge\ngraphs. Generally speaking, these models embed entities as well as relations jointly into a high\ndimensional space. Let s \u2208 Rds, o \u2208 Rdo, r \u2208 Rdr be the embeddings for entities s and o, and the\nrelation r, respectively. Compositional embeddings learn a score function \u03c3(.) that approximates the\nposterior probability of \u03c6r(s, o) conditioned on the dataset \u2126:\n\nPr (\u03c6r(s, o) = 1 | \u2126) = \u03c3(s, o, r).\n\n(1)\n\nMany models have been proposed with different functional forms for \u03c3 [e.g., 4, 5, 8, 12, 15, 16,\n18, 19, 23\u201325, 28]. A crucial part of these models is the compositional operators they use to\n\n2\n\n\fcapture the correlation between entities and relations. Given entities (and/or relations) embeddings\na = (a0, . . . , ada\u22121)(cid:48) \u2208 Rda and b = (b0, . . . , bdb\u22121)(cid:48) \u2208 Rdb, a compositional operator is a function\nf : Rda \u00d7 Rdb \u2192 Rdf , which maps a and b into another high dimensional space1. Such operators\nare used to combine the information from the embeddings of entities and relations to predict the\nlikelihood of a particular entity-relation tuple in the score function. A good compositional operator\nnot only extracts information effectively from a and b, but also trades it off with model complexity.\nOne approach is to use vector arithmetic operations, such as (weighted) vector addition and\nsubtraction used by TransE [5], TransH [28], and ProjE [23]. One drawback of this approach is that\nthe embedding dimensions remain independent in such vector operations, preventing the model from\ncapturing rich correlations across different dimensions. Another popular compositional operator is to\nconcatenate the embeddings of relations and entities, and later apply a non-linear activation function\nto implicitly capture correlations [8, 24].\nGiven the importance of capturing rich correlations, we focus on two representative compositional\noperators that explicitly model the correlations among entities and relations: the full tensor product\nand the holographic embedding, described below.\nFull Tensor Product Many models, such as RESCAL [18] and its compositional training exten-\nsion [9] and Neural Tensor Network [25], take the full tensor product as the compositional operator.\nGiven two embedding vectors a, b \u2208 Rd, the full tensor product is de\ufb01ned as a \u2297 b = abT , i.e.,\n\n[a \u2297 b]i,j = aibj.\n\n(2)\n\nThe full tensor product captures all pairwise multiplicative interactions between a and b. Intuitively,\na feature in a \u2297 b is \u201con\u201d (with large absolute value), if and only if the corresponding features in\nboth a and b are \u201con\u201d. This helps entities with multiple characteristics. For example, consider an\nentity Obama, who is a man, a basketball player, and a former president of the US. In the embeddings\nfor Obama, we can have one dimension \ufb01ring up when it is coupled with Chicago Bulls (basketball\nteam), but a different dimension \ufb01ring up when coupled with the White House.\nHowever, this rich expressive power comes at a cost: a huge parameter space, which makes it dif\ufb01cult,\nif not impossible, to effectively train a model on large datasets. For example, for RESCAL, the score\nfor a triple (s, r, o) is de\ufb01ned as:\n\n\u03c3(s, o, r) = grandsum(cid:0)(a \u2297 b) \u25e6 Wr\n\n(cid:1)\n\n(3)\nwhere Wr \u2208 Rd\u00d7d is the matrix encoding for relation r, \u25e6 refers to the Hadamard product (i.e., the\nelement-wise product), and grandsum refers to the sum of all entries of a matrix. With |R| relations,\nthe number of parameters is dominated by the embedding of all relations, totaling d2|R|. This quickly\nbecomes infeasible even for modestly sized knowledge graphs.\nHolographic Embedding HOLE provides an alternative compositional operator using the idea of\ncircular correlation. Given a, b \u2208 Rd, the holographic compositional operator h : Rd \u00d7 Rd \u2192 Rd\nproduces an interaction vector of the same dimension as a and b, with the k-th dimension being:\n\nd\u22121(cid:88)\n\nhk(a, b) = [a (cid:63) b]k =\n\naib(i+k) mod d.\n\n(4)\n\ni=0\n\nFigure 1 (left) provides a graphical illustration. HOLE computes the sum of each (circular) diagonal\nline of the original tensor product matrix, collapsing a two-dimensional matrix into a one-dimensional\nvector. Put another way, HOLE still captures pairwise interactions between different dimensions of a\nand b, but collapses everything along a each individual diagonal and retains only the sum for each\nsuch \u2018bucket\u2019. The HOLE score for a triple (s, r, o) is de\ufb01ned as:\n\n\u03c3(s, o, r) = (s (cid:63) o) \u00b7 r\n\n(5)\n\nwhere \u00b7 denotes dot-product. This, requires only d|R| parameters for encoding all relations.2\n\n1A composition operator can be between two entities, or between an entity and a relation.\n2As discussed later, our improved reimplementation of HOLE, inspired by recent work [23], uses a slight\nvariation for the tail-prediction (a.k.a. object-prediction) task, namely \u03c3(s, o, r) = [s (cid:63) r] \u00b7 o. Analogously for\nhead-prediction (a.k.a. subject-prediction).\n\n3\n\n\fFigure 1: (Left) Visualization of HOLE, which collapses the full tensor product M = ab(cid:48) into a\nvector by summing up along each (circular) diagonal line, depicted with the same color. (Middle)\nHOLE perturbed with a vector c, where each row of M is multiplied with one entry in c prior to the\nholographic operation. (Right) HOLEX using the \ufb01rst two Haar vectors. When M has dimension\nd \u00d7 d, this is equivalent to returning a 2 \u00d7 d matrix that sums up along each half of each diagonal\nline, depicted by the same color.\n\nThe circular correlation used in Holographic embedding can be seen as a projection of the full tensor\nproduct by weighting all interactions the same along each diagonal line. Given its similarity to\n(circular) convolution, the actual computation can be carried out ef\ufb01ciently with the fast Fourier\n. As before, \u25e6 refers to element-wise product.\n\ntransformation (FFT): h(a, b) = F\u22121(cid:16)F(a) \u25e6 F(b)\n(cid:17)\n\nF is the discrete Fourier transform. (x) represents the complex conjugate of x.\n\n3 Expanding Holographic Embeddings\n\nIs there a model that sits in between HOLE and the full tensor product, and provides a better trade-off\nthan either extreme between computational complexity and model complexity? We present Expanded\nHolographic Embeddings or HOLEX, which is a collection of models with increasing complexity\nthat provides a controlled way to interpolate HOLE and the full tensor product. Given a \ufb01xed vector\nc \u2208 Rd, we de\ufb01ne the perturbed holographic compositional operator for a, b \u2208 Rd as:\n\n(6)\nAs before, \u25e6 represents the Hadamard product and the score for a triple (s, r, o) is computed by taking\nthe dot product of this composition between two elements (e.g., s and o) and a d-dimensional vector\nencoding the third element (e.g., r). In other words, the k-th dimension of h now becomes:\n\nh(a, b; c) = (c \u25e6 a) (cid:63) b.\n\nd\u22121(cid:88)\n\nl(cid:88)\n\nj=0\n\n4\n\nhk(a, b; c) = [(c \u25e6 a) (cid:63) b]k =\n\nciaib(i+k) mod d.\n\n(7)\n\nIn practice, vector c is chosen prior to training. As depicted in Figure 1 (middle), HOLEX visually\n\ufb01rst forms the full tensor product of a and b, then multiplies each row with the corresponding\ndimension in c, and \ufb01nally sums up along each (circular) diagonal line. Computationally, HOLEX\n.\n\ncontinues to bene\ufb01t from the use of fast Fourier transform: h(a, b; c) = F\u22121(cid:16)F(c \u25e6 a) \u25e6 F(b)\n(cid:17)\n\ni=0\n\nOn one hand, HOLEX falls back to HOLE if we only use one perturbation vector with all non-zero\nentries. This is because one can always rescale a to subsume the effect of c.\nOn the other hand, we can expand HOLE to more complex models by using multiple perturbation\nvectors. Suppose we have l vectors c0, . . . , cl\u22121. The rank-l HOLEX is de\ufb01ned as the concatenation\nof the perturbed holographic embeddings induced by c0, . . . , cl\u22121, i.e.,\n\nh(a, b; c0, . . . , cl\u22121) = [h(a, b; c0), h(a, b; c1), . . . , h(a, b; cl\u22121)].\n\n(8)\nFor simplicity of notation, let matrix Cl denote (c0, . . . , cl\u22121) and write h(a, b; Cl) to represent\nh(a, b; c0, . . . , cl\u22121). Treating each h(a, b; ci) as a column vector, the entire expanded embedding,\nh(a, b; Cl), is a d \u00d7 l matrix. For the tail-prediction task (analogously for head-prediction), the \ufb01nal\nrank-l HOLEX score for a triple (s, r, o) is de\ufb01ned as:\n\n\u03c3(s, r, o) =\n\nh(s, r; cj) \u00b7 o.\n\n(9)\n\n\fImportantly, this expanded embedding has the same number of parameters as HOLE itself.3\nWe start with a basic question: Does rank-l HOLEX capture more information than rank-l(cid:48) when\nl > l(cid:48)? The answer is af\ufb01rmative if c0, . . . , cl\u22121 are linearly independent. In fact, Theorem 1\nshows that under this setting, rank-d HOLEX is equivalent to the full tensor product up to a linear\ntransformation.\nTheorem 1. Let a, b \u2208 Rd, l = d, and R be the matrix of the full tensor product matrix arranged\naccording to diagonal lines, i.e., Ri,j = aib(i+j) mod d. Then rank-d HOLEX satis\ufb01es:\n\nh(a, b; Cd) = RTCd.\n\nNote that this linear transformation is invertible if Cd has full rank. In other words, learning a rank-d\nexpanded holographic embedding is equivalent to learning the full tensor product. As an example,\nconsider the RESCAL model with score function rT(s \u2297 o). This is an inner product between the\nrelation embedding r and the full tensor product matrix (s \u2297 o) between the subject and object\nentities. Suppose we replace the tensor product matrix (s \u2297 o) with the full expanded holographic\nembedding h(s, o; Cd), obtaining a new model rTh(s, o; Cd). Theorem 1 states that the original\ntensor product matrix (s \u2297 o) is connected to h(s, o; Cd) via a linear transformation, making the\ntwo embedding models, rT(s \u2297 o) and rTh(s, o; Cd), essentially equivalent.\n\n3.1 Low Rank Holographic Expansions\n\nTheorem 1 states that if we could afford d perturbation vectors, then HOLEX is equivalent to the full\ntensor product (RESCAL) matrix. What happens if we cannot afford all d perturbation vectors? We\nwill see that, in this case, HOLEX forms a collection of models with increasingly richer representation\npower and correspondingly higher computational needs. Our goal is to choose a family of perturbation\nvectors that provides a substantial bene\ufb01t even if only a handful of vectors are used in HOLEX.\nDifferent choices of linearly independent families of perturbation vectors extract different information\nfrom the full RESCAL matrix, thereby leading to different empirical performance. For example,\nconsider using the truncated identity matrix Ik\u00d7d, i.e., the \ufb01rst k columns of the d \u00d7 d identity\nmatrix, as the perturbation vectors. This is equivalent to retaining the \ufb01rst k major diagonal lines\nof the full RESCAL matrix and ignoring everything else in it. Empirically, we found that using\nIk\u00d7d substantially worsened performance. Our intuition is that such a choice is worse than using\nperturbation vectors that condense information from the entire RESCAL matrix, i.e., vectors with a\nwider footprint. We consider two such perturbation families, Haar and random 0/1 vectors.\nThe following example illustrates the intuition behind such wide-footprint vectors being a better\n\ufb01t for the task than Ik\u00d7d. Consider the tuple (cid:104)Alice, review, paper42(cid:105). When we embed Alice\nand paper42 as entities, it may result in the i-th embedding dimension being indicative of a person\n(i.e., this dimension has a large value whenever the entity is a person) and the j-th dimension being\nindicative of an article. In this case, the (i, j)-th entry of the interaction matrix will have a large value\nfor the pair (cid:104)Alice, paper42(cid:105), signaling that it \ufb01ts the relation \u201creview\u201d. If the (i, j)-th entry is far\naway from the main diagonal, it will be zeroed out (thus losing the information) when using Ik\u00d7d for\nperturbation, but captured by vectors with a wide footprint.\n\n3.1.1 Perturbation with Low Frequency Haar Vectors\nThe Haar wavelet system [6, 10] is widely used in signal processing. The 2 \u00d7 2 Haar matrix H2\nassociated with the Haar wavelet is shown on the left in Figure 2, which also shows H4. In general,\nthe 2n \u00d7 2n Haar matrix H2n can be derived from the n \u00d7 n Haar matrix Hn as shown on the right\nin Figure 2, where \u2297k represents the Kronecker product and I the identity matrix.\nHaar matrices have many desirable properties. Consider multiplying H4 with a vector a. We can\nsee that the inner product between the \ufb01rst row of H4 and a gives the sum of the entries of a (i.e.,\ni=0 ai). The inner product between the second row of H4 and a gives the difference between the\n3While we discuss expansion in the context of HOLE, it is evident from Eq. (9) that one can easily generalize\nthe notion (even if not the theoretical results that follow) to any embedding method that can be decomposed\nas \u03c3(s, r, o) = g(f (s, r), o), or a similar decomposition for another permutation of s, r, o. In this case, the\n\n(cid:80)3\n\nexpanded version would simply be(cid:80)l\n\nj=0 g(f (cj \u25e6 s, r), o).\n\n5\n\n\f(cid:18)1\n\n(cid:19)\n\nH2 =\n\n1\n1 \u22121\n\nH4 =\n\n(cid:18)Hn \u2297k\n\nIn \u2297k\n\n(cid:19)\n\n[1, 1]\n[1,\u22121]\n\nH2n =\n\n\uf8eb\uf8ec\uf8ed1\n\n1\n\n1\n1\n1 \u22121 \u22121\n1\n1 \u22121\n0\n0\n1 \u22121\n0\n0\n\n\uf8f6\uf8f7\uf8f8\ni=0 ai \u2212(cid:80)3\n\nFigure 2: Haar matrices of order 2, 4, and 2n.\n\nsum of the \ufb01rst half of a and the second half ((cid:80)1\n\ni=2 ai). Importantly, we can infer the\nsum of each half of a by examining these two inner products. Generalizing this, consider the \ufb01rst 2k\nrows of Hd, referred to as the (unnormalized) 2k Haar wavelets with the lowest frequency. If we\nsplit vector a into 2k segments of equal size, then one can infer the partial sum of each segment by\ncomputing the inner product of a with the \ufb01rst 2k rows of Hd.\nThis view provides an intuitive interpretation of HOLEX when using perturbation with low frequency\nHaar vectors. In fact, we can prove that HOLEX using the \ufb01rst 2k rows of Hd yields an embedding\nthat contains the partial sums of 2k equal-sized segments along each (circular) diagonal line of the\ntensor product matrix. This is stated formally in Proposition 1. The case of using the \ufb01rst two rows of\nHd is visually depicted in the rightmost panel of Figure 1.\nProposition 1. Let 1 \u2264 k \u2264 K, d = 2K, l = 2k, Hl and Hd be Haar matrices of size l and\nd, respectively, and Hd,k be a matrix that contains the \ufb01rst l rows of Hd. h(a, b; H(cid:48)\nd,k) is the\ncompositional operator for HOLEX using Hd,k as perturbation vectors. Let R be the full tensor\nproduct matrix arranged according to diagonal lines, i.e., Ri,j = aib(i+j) mod d. De\ufb01ne:\n\nW =\n\n1\nl\n\nH T\n\nl h(a, b; H T\n\nd,k)\n\nThen, W captures the partial column sums of R. In other words, Wi,j is the sum of entries from\nRdi/l,j to Rd(i+1)/l\u22121,j, where all indices start from 0.\n\nProposition 1 formalizes how HOLEX forms an interpolation between HOLE and the full tensor\nproduct as an increasing number of Haar wavelets is used as perturbation vectors. While HOLE\ncaptures the sum along each full diagonal line, HOLEX gradually enriches the representation by\nadding subsequence sums as we include more and more rows from the Haar matrix.\n\n3.2 Projection with Random 0/1 Vectors\n\nWe next consider random perturbation vectors, each of whose entries is sampled independently and\nuniformly from {0, 1}. As suggested by Figure 1 (middle), HOLE perturbed with one such random\n0/1 vector is equivalent to randomly zeroing out roughly half the d rows (corresponding to the 0s in\nthe vector) from the tensor product matrix, before summing along each (circular) diagonal line.\n\nFigure 3: Sparse nature of full 64 \u00d7 64 RESCAL matrices learned from the FB15k dataset. The heat\nmap on the left shows a typical entity-relation matrix, i.e., a particular s \u2297 r. The plot on the right\nshows the average magnitudes of the entries in each (circular) diagonal line, normalized so that the\nlargest entry in each diagonal is 1, sorted in decreasing order, and averaged over the entire dataset.\n\nRandom vectors work particularly well if the full tensor product matrix is sparse, which turns out\nto often be the case. Figure 3 illustrates this sparsity for the FB15K dataset. The heat map on the\nleft highlights that there are relatively few large (dark) entries overall. The plot on the right shows\nthat each circular diagonal line, on average, is dominated by very few large entries. For example, on\n\n6\n\n020406002040600.20.40.60.8normalized value02040600.00.20.40.60.81.0normalized value\faverage, the 5th largest entry (out of 64) has a magnitude of only about half that of the largest. The\nvalues decay rapidly. This is in line with our expectation that different entries of the RESCAL matrix\ncarry different semantic information, not all of which is generally relevant for all entity-relation pairs.\nTo understand why random vectors are suitable for sparse interactions, consider the extreme but\nintuitive case where only one of the d entries in each diagonal line has large magnitude, and the rest\nare close to zero. For a particular diagonal line, one random vector zeros out a set of approximately\nd/2 entries, and the second random vector zeros out another set of d/2 entries, chosen randomly\nand independently of the \ufb01rst set. The number of entries that are not zeroed out by either of the two\nrandom vectors is thus approximately d/4. Continuing this reasoning, in expectation, only one entry\nwill \u201csurvive\u201d, i.e., remain not zeroed out, if one adds log2 d of such 0/1 random vectors.\nSuppose we apply HOLEX with 2 log d random vectors. For a particular diagonal line, approximately\nhalf (log d) of the random vectors will zero out the unique row of the large entry, thereby resulting in a\nsmall sum for that diagonal. Consider those log d random vectors that produce large sums. According\nto the previous reasoning, there is, in expectation, only one row that none of these vectors zeros out.\nThe intersection of this row and the diagonal line must, then, be the location of the large entry.\nTherefore, we have the following theorem, saying that if there is only one non-zero entry in every\ndiagonal, HOLEX can recover the whole matrix.\nTheorem 2. Suppose there is only one non-zero entry, of value 1, in each diagonal line of the\ntensor product matrix. Let \u03b7 > 0 and d be the embedding dimension. HOLEX expanded with\n(cid:100)3 log d \u2212 log \u03b7(cid:101) \u2212 1 random 0/1 vectors can locate the non-zero entry in each diagonal line of the\ntensor product matrix with probability at least 1 \u2212 \u03b7.\nAssuming exactly one non-zero entry per diagonal might be too strong, but it can be weakened using\ntechniques from compressed sensing, as re\ufb02ected in the following theorem:\nTheorem 3. Suppose each diagonal line of the tensor product matrix is s-sparse, i.e., has no\nmore than s non-zero entries. Let A \u2208 Rl\u00d7d be a random 0/1 matrix. Let \u03b7 \u2208 (0, 1) and l \u2265\nC(s log(d/s) + log(\u0001\u22121)) for a universal constant C > 0. Then HOLEX with the rows of A as\nperturbation vectors can recover the tensor product matrix, i.e., identify all non-zero entries, with\nprobability at least 1 \u2212 \u03b7.\nThe proofs of the above two theorems are deferred to the Appendix. We note that Theorem 3 also\nholds in the noisy setting where diagonal lines have s large entries, but are corrupted by some bounded\nnoise vector e. In this case, we do not expect to fully recover the original tensor product matrix, but\ncan identify a matrix that is close enough, which is suf\ufb01cient for machine learning applications. We\nomit the details (cf. Theorem 2.7 of Rauhut [20]). Thus, HOLEX works provably as long as each\ndiagonal of the tensor product matrix can be approximated by a sparse vector.\n\n4 Experiments\n\nFor evaluation, we use the standard knowledge completion dataset FB15K [5]. This dataset is a subset\nof Freebase [3], which contains a large number of general facts about the world. FB15K contains\n14,951 entities, 1,345 relations, and 592,213 facts. The facts are divided into 483,142 for training,\n50,000 for validation, and 59,071 for testing.\nWe follow the evaluation methodology of prior work in this area. For each triple (s, r, o), we create a\nhead prediction query (?, r, o) and a tail prediction query (s, r, ?). For head prediction (tail prediction\nis handled similarly), we use the knowledge completion method at hand to rank all entities based\non their predicted likelihood of being the correct head, resulting in an ordered list L. Since many\nrelations are not 1-to-1, there often are other (already known) valid facts of the form (s(cid:48), r, o) with\ns(cid:48) (cid:54)= s in the training data. To account for these equally valid answers, we follow prior work and \ufb01lter\nout such other valid heads from L to obtain L(cid:48). Finally, for this query, we compute three metrics: the\n0-based rank r of s in this (\ufb01ltered) ordered list L(cid:48), the reciprocal rank 1\nr+1, and whether s appears\namong the top k items in L(cid:48) (HITS@k, for k \u2208 {10, 5, 1}). The overall performance of the method\nis taken to be the average of these metrics across all head and tail prediction queries.\nWe reimplemented HOLE using the recent framework of Shi and Weninger [23], which is based\non TensorFlow [1] and is optimized for multiple CPUs. We consider both the original embedding\ndimension of 150, and a larger dimension of 256 that is better suited for our Haar vector based linear\n\n7\n\n\fperturbation. In the notation of Shi and Weninger [23], we changed their interaction component\nbetween entity e and relation r from e\u2295r = Dee+Drr+bc to the (expanded) holographic interaction\nh(e, r). We also dropped their non-linearity function, tanh, around this interaction for slightly better\nresults. Their other implementation choices were left intact, such as computing interaction between e\nand r rather than between two entities, using dropout, and other hyper-parameters.\n\n4.1\n\nImpact of Varying the Number of Perturbation Vectors\n\nTo gain insight into HOLEX, we \ufb01rst consider the impact of adding an increasing number l of\nlinear perturbation vectors. We start with a small embedding dimension, 32, which allows for a full\ninterpolation between HOLE and RESCAL. We then report results with embedding dimension 256,\nwith up to 8 random 0/1 vectors. Figure 4 depicts the resulting HITS@10 and Mean Rank metrics, as\nwell as the training time per epoch (on a 32-CPU machine on Google Cloud Platform).\n\nFigure 4: Impact of using a varying number l of random 0/1 vectors on the performance of HOLEX\non FB15K. Left: HITS@10 and Mean Rank for a full interpolation between HOLE (l = 1) and\nRESCAL (l = 32), when the embedding dimension d = 32. Middle: Similar trend for d = 256.\nRight: Training time per epoch for experiments with d = 256.\n\nOn both small- and large-scale experiments, we observe that both the mean rank and HIT@10 metrics\ngenerally improve (ignoring random \ufb02uctuations) as l increases. On the large-scale experiment, even\nwith l = 2, we already observe a substantial, 2.5% improvement in HITS@10 (higher is better) and\na reduction in mean rank (lower is better) by 8. In this particular case, mean rank saturates after\nl = 3, although HITS@10 continues to climb steadily till l = 8, and suggests further gains if even\nmore perturbation vectors were to be used. The rightmost plot indicates that the training time scales\nroughly linearly, thus making l an effective knob for trading off test performance with training time.\n\n4.2 Comparison with Existing Methods\n\nWe now compare HOLEX with several representative baselines. All baseline numbers, except for our\nreimplementation of HOLE, are taken from Shi and Weninger [23], who also report performances of\nadditional baselines, which fared worse than TransR reported here.\nRemark 1. In private communication, Shi and Weninger noted that the published numbers for their\nmethod, ProjE, were inaccurate due to a bug (Github issue #3). We use their updated code from\nhttps://github.com/bxshi/ProjE and new suggested parameters, reported here for completeness: max\n50 iterations, learning rate 0.0005, and negative sampling weight 0.1. We increased the embedding\ndimension from 200 to 256 for consistency with our method and reduced batch size to 128, which\nimproved the HITS@10 metric for the best variant, ProjE_listwise, from 80.0% to 82.9%. We use this\n\ufb01nal number here as the best ProjE baseline.\n\nTable 1 summarizes our main results, with various method sorted by increasing HITS@10 perfor-\nmance. The best baselines numbers are highlighted in bold, and so are the best numbers using our\nexpanded holographic embeddings method. We make a few observations.\nFirst, although the RESCAL approach [18], which works with the full outer product matrix, is capable\nof capturing rich correlation by looking at every pair of dimensions, the resulting quadratically many\nparameters make it dif\ufb01cult to train in practice, eventually resulting in poor performance.\nSecond, models such as TransE [5] and TransR [16] that rely on simple vector arithmetic such as\nadding/subtracting vectors, are unable to capture rich correlation, again resulting in low performance.\n\n8\n\n0102030number of 0/1 random vectors130140150160filtered_mean_rankfiltered_mean_rankfiltered_hits@100.560.580.600.62filtered_hits@102468number of 0/1 random vectors455055filtered_mean_rankfiltered_mean_rankfiltered_hits@100.840.860.88filtered_hits@102468number of 0/1 random vectors8009001000110012001300time per epoch(secs)\fKnowledge Completion Method\n\nEXISTING METHODS\n\nRESCAL [18]\nTransE [5]\nTransR [16]\nTransE + Rev [15]\nHOLE (original, dim=150) [19]\nHOLE (reimplementation, dim=150)\nProjE_pointwise(cid:63) (dim=256) [23]\nProjE_wlistwise(cid:63) (dim=256) [23]\nProjE_listwise(cid:63) (dim=256) [23]\nHolE (reimplementation, dim=256)\nComplEx [27]\nPTransE (ADD, len-2 path) [15]\nPTransE (ADD, len-3 path) [15]\nDistMult [29], re-tuned by Kadlec et al. [13]\n\nPROPOSED METHOD (dim=256)\n\nHolE (reimplemented baseline from above)\nHOLEX, 8 Haar vectors\nHOLEX, 2 random 0/1 vectors\nHOLEX, 4 random 0/1 vectors\nHOLEX, 8 random 0/1 vectors\nHOLEX, 16 random 0/1 vectors\n\nMean HITS@10 MRR HITS@5 HITS@1\nRank\n\n(%)\n\n(%)\n\n(%)\n\n683\n125\n77\n63\n-\n70\n71\n64\n53\n51\n-\n54\n58\n42\n\n51\n51\n48\n47\n47\n49\n\n44.1\n47.1\n68.7\n70.2\n73.9\n78.4\n80.2\n82.1\n82.9\n83.0\n84.0\n83.4\n84.6\n89.3\n\n83.0\n86.7\n85.4\n87.1\n87.9\n88.6\n\n-\n-\n-\n-\n\n-\n-\n\n-\n\n0.524\n0.588\n0.650\n0.666\n0.665\n0.665\n0.692\n\n0.798\n\n0.665\n\n0.720\n0.763\n0.786\n0.800\n\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n-\n\n72.0\n74.8\n76.8\n78.1\n77.9\n\n77.9\n\n81.4\n83.9\n85.0\n86.0\n\n-\n-\n-\n-\n\n-\n-\n-\n\n-\n\n40.2\n47.7\n56.7\n57.9\n56.8\n56.9\n59.9\n\n56.9\n\n64.0\n69.8\n73.1\n75.0\n\nTable 1: Expanded holographic embeddings, HOLEX, outperform a variety of knowledge completion\nmethods on the FB15K dataset. Mean Rank (0-based) and HITS@10 are the main metrics we track;\nother metrics are reported for a more comprehensive comparison with prior work. Numbers are\naverages across head- and tail-prediction tasks, individual results for which may be found in the\nAppendix. (cid:63) See Remark 1 for an explanation of ProjE results, and Footnote 4 for very recent models.\n\nThird, reimplementing HOLE using the ProjE framework increases HITS@10 from 73.9% to 78.4%,\nlikely due to improved training with the TensorFlow backend, regularization techniques like dropout,\nand entity-relation interaction rather than original HOLE\u2019s entity-entity interaction. Further, simply\nincreasing the embedding dimension from 150 to 256 allows HOLE to achieve 83.0% HITS@10,\nhigher than most baseline methods that do not explicitly model KG paths, except for DistMult [29]\nwhich was re-tuned very carefully for this task [13] to achieve state-of-the-art results.4\nRelative to the (reimplemented) HOLE baseline, our proposed HOLEX with 8 Haar vectors improves\nthe HITS@10 metric by 3.7%. The use of random 0/1 vectors appears somewhat more effective,\nachieving 88.6% HITS@10 with 16 such vectors, which is a 5.7% improvement over ProjE, which\nformed our codebase. This setting also achieves a mean reciprocal rank (MRR) of 0.800 and HITS@1\nof 75.0%, matching or outperforming a wide variety of existing methods along various metrics.5\n\n5 Conclusion\n\nWe proposed expanded holographic embeddings (HOLEX), a new family of embeddings for knowl-\nedge graphs that smoothly interpolates between the full product matrix of correlations on one hand,\nand an effective lower dimensionality method, namely HOLE, on the other. By concatenating several\nlinearly perturbed copies of HOLE, our approach allows the system to focus on different subspaces\nof the full embedding space, resulting in a richer representation. It recovers the full interaction matrix\nwhen suf\ufb01ciently many copies are used. Empirical results on the standard FB15K dataset demonstrate\nthe strength of HOLEX even with only a handful of perturbation vectors, and the bene\ufb01t of being able\nto select a point that effectively trades off expressivity of relational embeddings with computation.\n\n4A recent model called EKGN [22] outperforms this with HITS@10 at 92.7% and mean rank 38. Explicitly\nmodeling reciprocal relations, adding a new regularizer, and using weighted nuclear 3-norm have also recently\nbeen shown to improve both CP decomposition and ComplEx to HITS@10 of 91% and MRR 0.86, albeit with\nmuch larger embedding dimensions of 4,000 and 2,000 for CP and ComplEx, respectively [14, 21].\n\n5Our HOLEX implementation uses the hyper-parameters recommended for ProjE, except for embedding\ndimension 256 and batch size 128. Hyper-parameter tuning targeted for HOLEX should improve results further.\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning\n\non heterogeneous distributed systems. CoRR, abs/1603.04467, 2015.\n\n[2] Christian Bizer, Jens Lehmann, Georgi Kobilarov, S\u00f6ren Auer, Christian Becker, Richard\nCyganiak, and Sebastian Hellmann. DBpedia\u2013a crystallization point for the web of data. Web\nSemantics: science, services and agents on the world wide web, 7(3):154\u2013165, 2009.\n\n[3] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A\ncollaboratively created graph database for structuring human knowledge. In ICMD, pages\n1247\u20131250. ACM, 2008.\n\n[4] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. Joint learning of words and\n\nmeaning representations for open-text semantic parsing. In AISTATS, 2012.\n\n[5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In NIPS, pages 2787\u20132795, 2013.\n\n[6] Charles K. Chui, Jeffrey M. Lemm, and Sahra Sedigh. An Introduction to Wavelets. Academic\n\nPress, 1992.\n\n[7] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2D\n\nknowledge graph embeddings. In AAAI, 2018.\n\n[8] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin P Murphy, Thomas\nStrohmann, Shaohua Sun, and Wei Zhang. Knowledge Vault: A web-scale approach to\nprobabilistic knowledge fusion. In KDD, pages 601\u2013610, 2014.\n\n[9] Kelvin Guu, John G Miller, and Percy Liang. Traversing knowledge graphs in vector space. In\n\nEMNLP, 2015.\n\n[10] Alfred Haar. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen.\n\nSpringer-Verlag, 1910.\n\n[11] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, Edwin Lewis-Kelham, Gerard De Melo,\nand Gerhard Weikum. YAGO2: Exploring and querying world knowledge in time, space, context,\nand many languages. In WWW, pages 229\u2013232. ACM, 2011.\n\n[12] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. A latent\n\nfactor model for highly multi-relational data. In NIPS, 2012.\n\n[13] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselines\n\nstrike back. In Rep4NLP Workshop at ACL, 2017.\n\n[14] Timoth\u00e9e Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition\n\nfor knowledge base completion. In ICML, 2018.\n\n[15] Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, Maosong Sun, Siwei Rao, and Song Liu. Modeling\n\nrelation paths for representation learning of knowledge bases. In EMNLP, 2015.\n\n[16] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation\n\nembeddings for knowledge graph completion. In AAAI, 2015.\n\n[17] George A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38\n\n(11):39\u201341, 1995.\n\n[18] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\n\nlearning on multi-relational data. In ICML, pages 809\u2013816, 2011.\n\n[19] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of\n\nknowledge graphs. In AAAI, pages 1955\u20131961, 2016.\n\n[20] Holger Rauhut. Compressive sensing and structured random matrices. Theoretical foundations\n\nand numerical methods for sparse recovery, 9:1\u201392, 2010.\n\n10\n\n\f[21] Farnood Salehi, Robert Bamler, and Stephan Mandt. Probabilistic knowledge graph embeddings.\nIn NeurIPS-2018 Symposium on Advances in Approximate Bayesian Inference Probabilistic,\n2018.\n\n[22] Yelong Shen, Po-Sen Huang, Ming-Wei Chang, and Jianfeng Gao. Link prediction using\n\nembedded knowledge graphs. CoRR, abs/1611.04642v5, 2018.\n\n[23] Baoxu Shi and Tim Weninger. ProjE: Embedding projection for knowledge graph completion.\n\nIn AAAI, pages 1236\u20131242, 2017.\n\n[24] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic composi-\n\ntionality through recursive matrix-vector spaces. In EMNLP-CoNLL, 2012.\n\n[25] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with\n\nneural tensor networks for knowledge base completion. In NIPS, 2013.\n\n[26] Niket Tandon, Gerard de Melo, Fabian M. Suchanek, and Gerhard Weikum. WebChild:\n\nHarvesting and organizing commonsense knowledge from the web. In WSDM, 2014.\n\n[27] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard.\n\nComplex embeddings for simple link prediction. In ICML, pages 2071\u20132080, 2016.\n\n[28] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by\n\ntranslating on hyperplanes. In AAAI, 2014.\n\n[29] Bishan Yang, Wen tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and\n\nrelations for learning and inference in knowledge bases. In ICLR, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2202, "authors": [{"given_name": "Yexiang", "family_name": "Xue", "institution": "Purdue University"}, {"given_name": "Yang", "family_name": "Yuan", "institution": "Cornell University"}, {"given_name": "Zhitian", "family_name": "Xu", "institution": "Shanghai Jiaotong University"}, {"given_name": "Ashish", "family_name": "Sabharwal", "institution": "Allen Institute for AI"}]}