{"title": "Reasoning With Neural Tensor Networks for Knowledge Base Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 926, "page_last": 934, "abstract": "A common problem in knowledge representation and related fields is reasoning over a large joint knowledge graph, represented as triples of a relation between two entities. The goal of this paper is to develop a more powerful neural network model suitable for inference over these relationships. Previous models suffer from weak interaction between entities or simple linear projection of the vector space. We address these problems by introducing a neural tensor network (NTN) model which allow the entities and relations to interact multiplicatively. Additionally, we observe that such knowledge base models can be further improved by representing each entity as the average of vectors for the words in the entity name, giving an additional dimension of similarity by which entities can share statistical strength. We assess the model by considering the problem of predicting additional true relations between entities given a partial knowledge base. Our model outperforms previous models and can classify unseen relationships in WordNet and FreeBase with an accuracy of 86.2% and 90.0%, respectively.", "full_text": "Reasoning With Neural Tensor Networks\n\nfor Knowledge Base Completion\n\nRichard Socher\u2217, Danqi Chen*, Christopher D. Manning, Andrew Y. Ng\n\nComputer Science Department, Stanford University, Stanford, CA 94305, USA\n\nrichard@socher.org, {danqi,manning}@stanford.edu, ang@cs.stanford.edu\n\nAbstract\n\nKnowledge bases are an important resource for question answering and other tasks\nbut often suffer from incompleteness and lack of ability to reason over their dis-\ncrete entities and relationships.\nIn this paper we introduce an expressive neu-\nral tensor network suitable for reasoning over relationships between two entities.\nPrevious work represented entities as either discrete atomic units or with a single\nentity vector representation. We show that performance can be improved when en-\ntities are represented as an average of their constituting word vectors. This allows\nsharing of statistical strength between, for instance, facts involving the \u201cSumatran\ntiger\u201d and \u201cBengal tiger.\u201d Lastly, we demonstrate that all models improve when\nthese word vectors are initialized with vectors learned from unsupervised large\ncorpora. We assess the model by considering the problem of predicting additional\ntrue relations between entities given a subset of the knowledge base. Our model\noutperforms previous models and can classify unseen relationships in WordNet\nand FreeBase with an accuracy of 86.2% and 90.0%, respectively.\n\n1\n\nIntroduction\n\nOntologies and knowledge bases such as WordNet [1], Yago [2] or the Google Knowledge Graph are\nextremely useful resources for query expansion [3], coreference resolution [4], question answering\n(Siri), information retrieval or providing structured knowledge to users. However, they suffer from\nincompleteness and a lack of reasoning capability.\nMuch work has focused on extending existing knowledge bases using patterns or classi\ufb01ers applied\nto large text corpora. However, not all common knowledge that is obvious to people is expressed in\ntext [5, 6, 2, 7]. We adopt here the complementary goal of predicting the likely truth of additional\nfacts based on existing facts in the knowledge base. Such factual, common sense reasoning is\navailable and useful to people. For instance, when told that a new species of monkeys has been\ndiscovered, a person does not need to \ufb01nd textual evidence to know that this new monkey, too, will\nhave legs (a meronymic relationship inferred due to a hyponymic relation to monkeys in general).\nWe introduce a model that can accurately predict additional true facts using only an existing\ndatabase. This is achieved by representing each entity (i.e., each object or individual) in the database\nas a vector. These vectors can capture facts about that entity and how probable it is part of a certain\nrelation. Each relation is de\ufb01ned through the parameters of a novel neural tensor network which\ncan explicitly relate two entity vectors. The \ufb01rst contribution of this paper is the new neural tensor\nnetwork (NTN), which generalizes several previous neural network models and provides a more\npowerful way to model relational information than a standard neural network layer.\nThe second contribution is to introduce a new way to represent entities in knowledge bases. Previous\nwork [8, 9, 10] represents each entity with one vector. However, does not allow the sharing of\n\n\u2217Both authors contributed equally.\n\n1\n\n\fFigure 1: Overview of our model which learns vector representations for entries in a knowledge base\nin order to predict new relationship triples. If combined with word representations, the relationships\ncan be predicted with higher accuracy and for entities that were not in the original knowledge base.\n\nstatistical strength if entity names share similar substrings. Instead, we represent each entity as the\naverage of its word vectors, allowing the sharing of statistical strength between the words describing\neach entity e.g., Bank of China and China.\nThe third contribution is the incorporation of word vectors which are trained on large unlabeled text.\nThis readily available resource enables all models to more accurately predict relationships.\nWe train on relationships in WordNet and Freebase and evaluate on a heldout set of unseen relational\ntriplets. Our model outperforms previously introduced related models such as those of [8, 9, 10]. Our\nnew model, illustrated in Fig. 1, outperforms previous knowledge base models by a large margin.\nWe will make the code and dataset available at www.socher.org.\n\n2 Related Work\n\nThe work most similar to ours is that by Bordes et al. [8] and Jenatton et al. [9] who also learn\nvector representations for entries in a knowledge base. We implement their approach and compare\nto it directly. Our new model outperforms this and other previous work. We also show that both our\nand their model can bene\ufb01t from initialization with unsupervised word vectors.\nAnother related approach is by Sutskever et al. [11] who use tensor factorization and Bayesian\nclustering for learning relational structures. Instead of clustering the entities in a nonparametric\nBayesian framework we rely purely on learned entity vectors. Their computation of the truth of a\nrelation can be seen as a special case of our proposed model. Instead of using MCMC for inference\nand learning, we use standard forward propagation and backpropagation techniques modi\ufb01ed for\nthe NTN. Lastly, we do not require multiple embeddings for each entity. Instead, we consider the\nsubunits (space separated words) of entity names.\nOur Neural Tensor Network is related to other models in the deep learning literature. Ranzato and\nHinton [12] introduced a factored 3-way Restricted Boltzmann Machine which is also parameterized\nby a tensor. Recently, Yu et al. [13] introduce a model with tensor layers for speech recognition.\nTheir model is a special case of our model and is only applicable inside deeper neural networks. Si-\nmultaneously with this paper, we developed a recursive version of this model for sentiment analysis\n[14].\nThere is a vast amount of work on extending knowledge bases by parsing external, text corpora\n[5, 6, 2], among many others. The \ufb01eld of open information extraction [15], for instance, extracts\nrelationships from millions of web pages. This work is complementary to ours; we mainly note that\nlittle work has been done on knowledge base extension based purely on the knowledge base itself or\nwith readily available resources but without re-parsing a large corpus.\n\n2\n\nReasoning about Relations Knowledge Base Relation: has part tail leg \u2026. cat dog \u2026. Relation: type of cat limb \u2026. tiger leg \u2026. tiger \u2026 Bengal tiger \u2026. Relation: instance of Does a Bengal tiger have a tail? ( Bengal tiger, has part, tail) Confidence for Triplet R Neural Tensor Network e1 e2 Word Vector Space eye tail leg dog cat tiger Bengal India \fLastly, our model can be seen as learning a tensor factorization, similar to Nickel et al. [16]. In the\ncomparison of Bordes et al. [17] these factorization methods have been outperformed by energy-\nbased models.\nMany methods that use knowledge bases as features such as [3, 4] could bene\ufb01t from a method that\nmaps the provided information into vector representations. We learn to modify word representations\nvia grounding in world knowledge. This essentially allows us to analyze word embeddings and\nquery them for speci\ufb01c relations. Furthermore, the resulting vectors could be used in other tasks\nsuch as named entity recognition [18] or relation classi\ufb01cation in natural language [19].\n\n3 Neural Models for Reasoning over Relations\nThis section introduces the neural tensor network that reasons over database entries by learning\nvector representations for them. As shown in Fig. 1, each relation triple is described by a neural\nnetwork and pairs of database entities which are given as input to that relation\u2019s model. The model\nreturns a high score if they are in that relationship and a low one otherwise. This allows any fact,\nwhether implicit or explicitly mentioned in the database to be answered with a certainty score. We\n\ufb01rst describe our neural tensor model and then show that many previous models are special cases of\nit.\n\n3.1 Neural Tensor Networks for Relation Classi\ufb01cation\n\nThe goal is to learn models for common sense reasoning, the ability to realize that some facts hold\npurely due to other existing relations. Another way to describe the goal is link prediction in an\nexisting network of relationships between entity nodes. The goal of our approach is to be able\nto state whether two entities (e1, e2) are in a certain relationship R. For instance, whether the\nrelationship (e1, R, e2) = (Bengal tiger, has part, tail) is true and with what certainty. To this end,\nwe de\ufb01ne a set of parameters indexed by R for each relation\u2019s scoring function. Let e1, e2 \u2208 Rd be\nthe vector representations (or features) of the two entities. For now we can assume that each value\nof this vector is randomly initialized to a small uniformly random number.\nThe Neural Tensor Network (NTN) replaces a standard linear neural network layer with a bilin-\near tensor layer that directly relates the two entity vectors across multiple dimensions. The model\ncomputes a score of how likely it is that two entities are in a certain relationship by the following\nNTN-based function:\n\n(cid:18)\n\n(cid:19)\n\n(cid:21)\n\n(cid:20)e1\n\ne2\n\ng(e1, R, e2) = uT\n\nRf\n\n1 W [1:k]\neT\n\nR e2 + VR\n\n+ bR\n\n,\n\n(1)\n\n\u2208 Rd\u00d7d\u00d7k is a tensor and\nR e2 results in a vector h \u2208 Rk, where each entry is computed by\nR e2. The other parameters for relation R are the\n\n1 W [i]\n\nR\n\n1 W [1:k]\n\nwhere f = tanh is a standard nonlinearity applied element-wise, W [1:k]\nthe bilinear tensor product eT\none slice i = 1, . . . , k of the tensor: hi = eT\nstandard form of a neural network: VR \u2208 Rk\u00d72d and U \u2208 Rk, bR \u2208 Rk.\nFig. 2 shows a visualization of this model. The main\nadvantage is that it can relate the two inputs mul-\ntiplicatively instead of only implicitly through the\nnonlinearity as with standard neural networks where\nthe entity vectors are simply concatenated.\nIntu-\nitively, we can see each slice of the tensor as being\nresponsible for one type of entity pair or instantiation\nof a relation. For instance, the model could learn that\nboth animals and mechanical entities such as cars\ncan have parts (i.e., (car, has part, x)) from differ-\nent parts of the semantic word vector space. In our\nexperiments, we show that this results in improved\nperformance. Another way to interpret each tensor\nslice is that it mediates the relationship between the\ntwo entity vectors differently.\n\nFigure 2: Visualization of the Neural Tensor\nNetwork. Each dashed box represents one\nslice of the tensor, in this case there are k = 2\nslices.\n\n3\n\nLinear Slices of Standard Bias Layer Tensor Layer LayerUT f( e1T W[1:2] e2 + V + b )e1e2 f + +Neural Tensor Layer\f3.2 Related Models and Special Cases\n\nWe now introduce several related models in increasing order of expressiveness and complexity. Each\nmodel assigns a score to a triplet using a function g measuring how likely the triplet is correct. The\nideas and strengths of these models are combined in our new Neural Tensor Network de\ufb01ned above.\nDistance Model. The model of Bordes et al. [8] scores relationships by mapping the left and right\nentities to a common space using a relationship speci\ufb01c mapping matrix and measuring the L1\ndistance between the two. The scoring function for each triplet has the following form:\n\ng(e1, R, e2) = (cid:107)WR,1e1 \u2212 WR,2e2(cid:107)1,\n\nwhere WR,1, WR,2 \u2208 Rd\u00d7d are the parameters of relation R\u2019s classi\ufb01er. This similarity-based\nmodel scores correct triplets lower (entities most certainly in a relation have 0 distance). All other\nfunctions are trained to score correct triplets higher. The main problem with this model is that the\nparameters of the two entity vectors do not interact with each other, they are independently mapped\nto a common space.\nSingle Layer Model. The second model tries to alleviate the problems of the distance model by\nconnecting the entity vectors implicitly through the nonlinearity of a standard, single layer neural\nnetwork. The scoring function has the following form:\n\n(cid:18)\n\n(cid:21)(cid:19)\n\n(cid:20)e1\n\ne2\n\ng(e1, R, e2) = uT\n\nRf (WR,1e1 + WR,2e2) = uT\nRf\n\n[WR,1WR,2]\n\n,\n\nwhere f = tanh, WR,1, WR,2 \u2208 Rk\u00d7d and uR \u2208 Rk\u00d71 are the parameters of relation R\u2019s scoring\nfunction. While this is an improvement over the distance model, the non-linearity only provides a\nweak interaction between the two entity vectors at the expense of a harder optimization problem.\nCollobert and Weston [20] trained a similar model to learn word vector representations using words\nin their context. This model is a special case of the tensor neural network if the tensor is set to 0.\nHadamard Model. This model was introduced by Bordes et al. [10] and tackles the issue of weak\nentity vector interaction through multiple matrix products followed by Hadamard products. It is\ndifferent to the other models in our comparison in that it represents each relation simply as a single\nvector that interacts with the entity vectors through several linear products all of which are parame-\nterized by the same parameters. The scoring function is as follows:\n\ng(e1, R, e2) = (W1e1 \u2297 Wrel,1eR + b1)T (W2e2 \u2297 Wrel,2eR + b2)\n\nwhere W1, Wrel,1, W2, Wrel,2 \u2208 Rd\u00d7d and b1, b2 \u2208 Rd\u00d71 are parameters that are shared by all\nrelations. The only relation speci\ufb01c parameter is eR. While this allows the model to treat relational\nwords and entity words the same way, we show in our experiments that giving each relationship its\nown matrix operators results in improved performance. However, the bilinear form between entity\nvectors is by itself desirable.\nBilinear Model. The fourth model [11, 9] \ufb01xes the issue of weak entity vector interaction through a\n1 WRe2, where\nrelation-speci\ufb01c bilinear form. The scoring function is as follows: g(e1, R, e2) = eT\nWR \u2208 Rd\u00d7d are the only parameters of relation R\u2019s scoring function. This is a big improvement\nover the two previous models as it incorporates the interaction of two entity vectors in a simple\nand ef\ufb01cient way. However, the model is now restricted in terms of expressive power and number\nof parameters by the word vectors. The bilinear form can only model linear interactions and is\nnot able to \ufb01t more complex scoring functions. This model is a special case of NTNs with VR =\n0, bR = 0, k = 1, f = identity. In comparison to bilinear models, the neural tensor has much\nmore expressive power which will be useful especially for larger databases. For smaller datasets the\nnumber of slices could be reduced or even vary between relations.\n\n3.3 Training Objective and Derivatives\n\nAll models are trained with contrastive max-margin objective functions. The main idea is that each\ntriplet in the training set T (i) = (e(i)\n2 ) should receive a higher score than a triplet in which\none of the entities is replaced with a random entity. There are NR many relations, indexed by R(i)\nfor each triplet. Each relation has its associated neural tensor net parameters. We call the triplet\n\n1 , R(i), e(i)\n\n4\n\n\fwith a random entity corrupted and denote the corrupted triplet as T (i)\n1 , R(i), ec), where we\nsampled entity ec randomly from the set of all entities that can appear at that position in that relation.\nLet the set of all relationships\u2019 NTN parameters be \u2126 = u, W, V, b, E. We minimize the following\nobjective:\n\nc = (e(i)\n\nJ(\u2126) =\n\nmax\n\n0, 1 \u2212 g\n\n+ g\n\nT (i)\nc\n\n+ \u03bb(cid:107)\u2126(cid:107)2\n2,\n\n(cid:16)\n\nT (i)(cid:17)\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n(cid:16)\n\nN(cid:88)\n\nC(cid:88)\n\ni=1\n\nc=1\n\nwhere N is the number of training triplets and we score the correct relation triplet higher than its\ncorrupted one up to a margin of 1. For each correct triplet we sample C random corrupted triplets.\nWe use standard L2 regularization of all the parameters, weighted by the hyperparameter \u03bb.\nThe model is trained by taking derivatives with respect to the \ufb01ve groups of parameters. The deriva-\ntives for the standard neural network weights V are the same as in general backpropagation. Drop-\nping the relation speci\ufb01c index R, we have the following derivative for the j\u2019th slice of the full\ntensor:\n\n\u2202g(e1, R, e2)\n\n\u2202W [j]\n\n= ujf(cid:48)(zj)e1eT\n\n2 , where zj = eT\n\n1 W [j]e2 + Vj\u00b7\n\n+ bj,\n\n(cid:21)\n\n(cid:20)e1\n\ne2\n\nwhere Vj\u00b7 is the j\u2019th row of the matrix V and we de\ufb01ned zj as the j\u2019th element of the k-dimensional\nhidden tensor layer. We use minibatched L-BFGS for optimization which converges to a local\noptimum of our non-convex objective function. We also experimented with AdaGrad but found that\nit performed slightly worse.\n3.4 Entity Representations Revisited\nAll the above models work well with randomly initialized entity vectors. In this section we introduce\ntwo further improvements: representing entities by their word vectors and initializing word vectors\nwith pre-trained vectors.\nPrevious work [8, 9, 10] assigned a single vector representation to each entity of the knowledge base,\nwhich does not allow the sharing of statistical strength between the words describing each entity.\nInstead, we model each word as a d-dimensional vector \u2208 Rd and compute an entity vector as the\ncomposition of its word vectors. For instance, if the training data includes a fact that homo sapiens\nis a type of hominid and this entity is represented by two vectors vhomo and vsapiens, we may extend\nthe fact to the previously unseen homo erectus, even though its second word vector for erectus might\nstill be close to its random initialization.\nHence, for a total number of NE entities consisting of NW many unique words, if we train on\nthe word level (the training error derivatives are also back-propagated to these word vectors), and\nrepresent entities by word vectors, the full embedding matrix has dimensionality E \u2208 Rd\u00d7NW .\nOtherwise we represent each entity as an atomic single vector and train the entity embedding matrix\nE \u2208 Rd\u00d7NE .\nWe represent the entity vector by averaging its word vectors. For example, vhomo sapiens =\n0.5(vhomo+vsapiens). We have also experimented with Recursive Neural Networks (RNNs) [21, 19]\nfor the composition. In the WordNet subset over 60% of the entities have only a single word and\nover 90% have less or equal to 2 words. Furthermore, most of the entities do not exhibit a clear\ncompositional structure, e.g., people names in Freebase. Hence, RNNs did not show any distinct\nimprovement over simple averaging and we will not include them in the experimental results.\nTraining word vectors has the additional advantage that we can bene\ufb01t from pre-trained unsuper-\nvised word vectors, which in general capture some distributional syntactic and semantic information.\nWe will analyze how much it helps to use these vectors for initialization in Sec. 4.2. Unless other-\nwise speci\ufb01ed, we use the d = 100-dimensional vectors provided by [18]. Note that our approach\ndoes not explicitly deal with polysemous words. One possible future extension is to incorporate the\nidea of multiple word vectors per word as in Huang et al. [22].\n4 Experiments\nExperiments are conducted on both WordNet [1] and FreeBase [23] to predict whether some re-\nlations hold using other facts in the database. This can be seen as common sense reasoning [24]\nover known facts or link prediction in relationship networks. For instance, if somebody was born\n\n5\n\n\fin London, then their nationality would be British. If a German Shepard is a dog, it is also a verte-\nbrate. Our models can obtain such knowledge (with varying degrees of accuracy) by jointly learning\nrelationship classi\ufb01ers and entity representations.\nWe \ufb01rst describe the datasets, then compare the above models and conclude with several analyses of\nimportant modeling decisions, such as whether to use entity vectors or word vectors.\n4.1 Datasets\n\n# Ent.\n38,696\n75,043\n\n# Train\n112,581\n316,232\n\n# Dev\n2,609\n5,908\n\nDataset\nWordnet\nFreebase\n\n#R.\n11\n13\n\n# Test\n10,544\n23,733\n\nTable 1: The statistics for WordNet and Freebase including number of different relations #R.\n\nTable 1 gives the statistics of the databases. For WordNet we use 112,581 relational triplets for\ntraining. In total, there are 38,696 unique entities in 11 different relations. One important difference\nto previous work is our dataset generation which \ufb01lters trivial test triplets. We \ufb01lter out tuples from\nthe testing set if either or both of their two entities also appear in the training set in a different relation\nor order. For instance, if (e1, similar to, e2) appears in training set, we delete (e2, similar to, e1) and\n(e1, type of, e2), etc from the testing set. In the case of synsets containing multiple words, we pick\nthe \ufb01rst, most frequent one. For FreeBase, we use the relational triplets from People domain, and\nextract 13 relations. We remove 6 of them (place of death, place of birth, location, parents, children,\nspouse) from the testing set since they are very dif\ufb01cult to predict, e.g., the name of somebody\u2019s\nspouse is hard to infer from other knowledge in the database.\nIt is worth noting that the setting of FreeBase is profoundly different from WordNet\u2019s. In WordNet,\ne1 and e2 can be arbitrary entities; but in FreeBase, e1 is restricted to be a person\u2019s name, and e2\ncan only be chosen from a \ufb01nite answer set. For example, if R = gender, e2 can only be male or\nfemale; if R = nationality, e2 can only be one of 188 country names. All the relations for testing\nand their answer set sizes are shown in Fig. 3.\nWe use a different evaluation set from [8] because it has become apparent to us and them that\nthere were issues of overlap between their training and testing sets which impacted the quality and\ninterpretability of their evaluation.\n\n4.2 Relation Triplets Classi\ufb01cation\n\nOur goal is to predict correct facts in the form of relations (e1, R, e2) in the testing data. This could\nbe seen as answering questions such as Does a dog have a tail?, using the scores g(dog, has part,\ntail) computed by the various models.\nWe use the development set to \ufb01nd a threshold TR for each relation such that if g(e1, R, e2) \u2265 TR,\nthe relation (e1, R, e2) holds, otherwise it does not hold.\nIn order to create a testing set for classi\ufb01cation, we randomly switch entities from correct testing\ntriplets resulting in a total of 2\u00d7#Test triplets with equal number of positive and negative examples.\nIn particular, we constrain the entities from the possible answer set for Freebase by only allowing\nentities in a position if they appeared in that position in the dataset. For example, given a correct\ntriplet (Pablo Picaso, nationality, Spain), a potential negative example is (Pablo Picaso, nationality,\nUnited States). We use the same way to generate the development set. This forces the model to focus\non harder cases and makes the evaluation harder since it does not include obvious non-relations such\nas (Pablo Picaso, nationality, Van Gogh). The \ufb01nal accuracy is based on how many triplets are\nclassi\ufb01ed correctly.\nModel Comparisons\nWe \ufb01rst compare the accuracy among different models. In order to get the highest accuracy for all\nthe models, we cross-validate using the development set to \ufb01nd the best hyperparameters: (i) vector\ninitialization (see next section); (ii) regularization parameter \u03bb = 0.0001; (iii) the dimensionality\nof the hidden vector (for the single layer and NTN models d = 100) and (iv) number of training\niterations T = 500. Finally, the number of slices was set to 4 in our NTN model.\nTable 2 shows the resulting accuracy of each model. Our Neural Tensor Network achieves an accu-\nracy of 86.2% on the Wordnet dataset and 90.0% on Freebase, which is at least 2% higher than the\nbilinear model and 4% higher than the Single Layer Model.\n\n6\n\n\fModel\nDistance Model\nHadamard Model\nSingle Layer Model\nBilinear Model\nNeural Tensor Network\n\nWordNet\n\n68.3\n80.0\n76.0\n84.1\n86.2\n\nFreebase Avg.\n64.7\n74.4\n80.7\n85.9\n88.1\n\n61.0\n68.8\n85.3\n87.7\n90.0\n\nTable 2: Comparison of accuracy of the different models described in Sec. 3.2 on both datasets.\n\nFigure 3: Comparison of accuracy of different relations on both datasets. For FreeBase, the number\nin the bracket denotes the size of possible answer set.\nFirst, we compare the accuracy among different relation types. Fig. 3 reports the accuracy of each\nrelation on both datasets. Here we use our NTN model for evaluation, other models generally have\na lower accuracy and a similar distribution among different relations. The accuracy re\ufb02ects the\ndif\ufb01culty of inferring a relationship from the knowledge base.\nOn WordNet, the accuracy varies from 75.5% (domain region) to 97.5% (subordinate instance of ).\nReasoning about some relations is more dif\ufb01cult than others, for instance, the relation (dramatic art,\ndomain region, closed circuit television) is much more vague than the relation (missouri, subordinate\ninstance of, river). Similarly, the accuracy varies from 77.2% (institution) to 96.6% (gender) in\nFreeBase. We can see that the two easiest relations for reasoning are gender and nationality, and the\ntwo most dif\ufb01cult ones are institution and cause of death. Intuitively, we can infer the gender and\nnationality from the name, location, or profession of a person, but we hardly infer a person\u2019s cause\nof death from all other information.\nWe now analyze the choice of entity representations and also the in\ufb02uence of word initializations. As\nexplained in Sec. 3.4, we compare training entity vectors (E \u2208 Rd\u00d7NE ) and training word vectors\n(E \u2208 Rd\u00d7NW ), where an entity vector is computed as the average of word vectors. Furthermore, we\ncompare random initialization and unsupervised initialization for training word vectors. In summary,\nwe explore three options: (i) entity vectors (EV); (ii) randomly initialized word vectors (WV); (iii)\nword vectors initialized with unsupervised word vectors (WV-init).\nFig. 4 shows the various models and their performance with these three settings. We observe\nthat word vectors consistently and signi\ufb01cantly outperform entity vectors on WordNet and this also\nholds in most cases on FreeBase. It might be because the entities in WordNet share more common\nwords. Furthermore, we can see that most of the models have improved accuracy with initialization\nfrom unsupervised word vectors. Even with random initialization, our NTN model with training\nword vectors can reach high classi\ufb01cation accuracy: 84.7% and 88.9% on WordNet and Freebase\nrespectively. In other words, our model is still able to perform good reasoning without external\ntextual resources.\n\n4.3 Examples of Reasoning\n\nWe have shown that our model can achieve high accuracy when predicting whether a relational triplet\nis true or not. In this section, we give some example predictions. In particular, we are interested in\nhow the model does transitive reasoning across multiple relationships in the knowledge base.\nFirst, we demonstrate examples of relationship predictions by our Neural Tensor Network on Word-\nNet. We select the \ufb01rst entity and a relation and then sort all the entities (represented by their word\n\n7\n\n707580859095100has instancetype ofmember meronymmember holonympart ofhas partsubordinate instance ofdomain regionsynset domain topicsimilar todomain topicWordNetAccuracy (%)707580859095100gender (2)nationality (188)profession (455)institution (727)cause of death (170)religion (107)ethnicity (211)FreeBaseAccuracy (%)\fFigure 4: In\ufb02uence of entity representations. EV: entity vectors. WV: randomly initialized word\nvectors. WV-init: word vectors initialized with unsupervised semantic word vectors.\n\nRelationship R\ntype of\ntype of\nsubordinate instance of\ndomain region\nhas instance\ntype of\n\nEntity e1\ntube\ncreator\ndubrovnik\narmed forces\nboldness\npeole\nTable 3: Examples of a ranking by the model for right hand side entities in WordNet. The ranking\nis based on the scores that the neural tensor network assigns to each triplet.\n\nSorted list of entities likely to be in this relationship\nstructure; anatomical structure; device; body; body part; organ\nindividual; adult; worker; man; communicator; instrumentalist\ncity; town; city district; port; river; region; island\nmilitary operation; naval forces; military of\ufb01cier; military court\naudaciousness; aggro; abductor; interloper; confession;\ngroup; agency; social group; organisation; alphabet; race\n\nvector averages) by descending scores that the model assigns to the complete triplet. Table 3 shows\nsome examples for several relations, and most of the inferred relations among them are plausible.\nFig. 5 illustrates a real example from FreeBase in\nwhich a person\u2019s information is inferred from the\nother relations provided in training. Given place\nof birth is Florence and profession is historian, our\nmodel can accurately predict that Francesco Guic-\nciardini\u2019s gender is male and his nationality is Italy.\nThese might be infered from two pieces of com-\nmon knowledge: (i) Florence is a city of Italy; (ii)\nFrancesco is a common name among males in Italy.\nThe key is how our model can derive these facts from\nthe knowledge base itself, without the help of ex-\nternal information. For the \ufb01rst fact, some relations\nsuch as Matteo Rosselli has location Florence and\nnationality Italy exist in the knowledge base, which\nmight imply the connection between Florence and\nItaly. For the second fact, we can see that many\nother people e.g., Francesco Patrizi are shown Ital-\nian or male in the FreeBase, which might imply that\nFrancesco is a male or Italian name. It is worth not-\ning that we do not have an explicit relation between Francesco Guicciardini and Francesco Patrizi;\nthe dashed line in Fig. 5 shows the bene\ufb01ts from the sharing via word representations.\n\nFigure 5: A reasoning example in Free-\nBase. Black lines denote relationships given\nin training, red lines denote relationships the\nmodel inferred. The dashed line denotes\nword vector sharing.\n\n5 Conclusion\n\nWe introduced Neural Tensor Networks for knowledge base completion. Unlike previous models\nfor predicting relationships using entities in knowledge bases, our model allows mediated interac-\ntion of entity vectors via a tensor. The model obtains the highest accuracy in terms of predicting\nunseen relationships between entities through reasoning inside a given knowledge base. It enables\nthe extension of databases even without external textual resources. We further show that by rep-\nresenting entities through their constituent words and initializing these word representations using\nreadily available word vectors, performance of all models improves substantially. Potential path for\nfuture work include scaling the number of slices based on available training data for each relation\nand extending these ideas to reasoning over free text.\n\n8\n\nDistanceHadamardSingle LayerBilinearNTN505560657075808590WordNetAccuracy (%) EVWVWV\u2212initDistanceHadamardSingle LayerBilinearNTN6065707580859095FreeBaseAccuracy (%) EVWVWV\u2212initFrancesco Guicciardini historian male Italy Florence Francesco Patrizi Matteo Rosselli profession gender place of birth nationality location nationality nationality gender \fAcknowledgments\nRichard is partly supported by a Microsoft Research PhD fellowship. The authors gratefully acknowledge the\nsupport of a Natural Language Understanding-focused gift from Google Inc., the Defense Advanced Research\nProjects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research\nLaboratory (AFRL) prime contract no. FA8750-13-2-0040, the DARPA Deep Learning program under contract\nnumber FA8650-10-C-7020 and NSF IIS-1159679. Any opinions, \ufb01ndings, and conclusions or recommenda-\ntions expressed in this material are those of the authors and do not necessarily re\ufb02ect the view of DARPA,\nAFRL, or the US government.\n\nReferences\n[1] G.A. Miller. WordNet: A Lexical Database for English. Communications of the ACM, 1995.\n[2] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge.\nIn\n\nProceedings of the 16th international conference on World Wide Web, 2007.\n\n[3] J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for uni\ufb01ed ranked\nretrieval of heterogeneous XML and web documents. In Proceedings of the 31st international\nconference on Very large data bases, VLDB, 2005.\n\n[4] V. Ng and C. Cardie. Improving machine learning approaches to coreference resolution. In\n\nACL, 2002.\n\n[5] R. Snow, D. Jurafsky, and A. Y. Ng. Learning syntactic patterns for automatic hypernym\n\ndiscovery. In NIPS, 2005.\n\n[6] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction.\n\nIn EMNLP, 2011.\n\n[7] G. Angeli and C. D. Manning. Philosophers are mortal: Inferring the truth of unseen facts. In\n\nCoNLL, 2013.\n\n[8] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowl-\n\nedge bases. In AAAI, 2011.\n\n[9] R. Jenatton, N. Le Roux, A. Bordes, and G. Obozinski. A latent factor model for highly\n\nmulti-relational data. In NIPS, 2012.\n\n[10] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint Learning of Words and Meaning Repre-\n\nsentations for Open-Text Semantic Parsing. AISTATS, 2012.\n\n[11] I. Sutskever, R. Salakhutdinov, and J. B. Tenenbaum. Modelling relational data using Bayesian\n\nclustered tensor factorization. In NIPS, 2009.\n\n[12] M. Ranzato and A. Krizhevsky G. E. Hinton. Factored 3-Way Restricted Boltzmann Machines\n\nFor Modeling Natural Images. AISTATS, 2010.\n\n[13] D. Yu, L. Deng, and F. Seide. Large vocabulary speech recognition using deep tensor neural\n\nnetworks. In INTERSPEECH, 2012.\n\n[14] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive\n\ndeep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.\n\n[15] A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. Textrunner:\n\nOpen information extraction on the web. In HLT-NAACL (Demonstrations), 2007.\n\n[16] M. Nickel, V. Tresp, and H. Kriegel. A three-way model for collective learning on multi-\n\nrelational data. In ICML, 2011.\n\n[17] A. Bordes, N. Usunier, A. Garca-Durn, J. Weston, and O. Yakhnenko. Irre\ufb02exive and hierar-\n\nchical relations as translations. CoRR, abs/1304.7158, 2013.\n\n[18] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for\n\nsemi-supervised learning. In Proceedings of ACL, pages 384\u2013394, 2010.\n\n[19] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic Compositionality Through\n\nRecursive Matrix-Vector Spaces. In EMNLP, 2012.\n\n[20] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: deep neural\n\nnetworks with multitask learning. In ICML, 2008.\n\n9\n\n\f[21] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic Pooling and\n\nUnfolding Recursive Autoencoders for Paraphrase Detection. In NIPS. MIT Press, 2011.\n\n[22] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving Word Representations via\n\nGlobal Context and Multiple Word Prototypes. In ACL, 2012.\n\n[23] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created\ngraph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD\ninternational conference on Management of data, SIGMOD, 2008.\n\n[24] N. Tandon, G. de Melo, and G. Weikum. Deriving a web-scale commonsense fact database. In\n\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI 2011), 2011.\n\n10\n\n\f", "award": [], "sourceid": 504, "authors": [{"given_name": "Richard", "family_name": "Socher", "institution": "Stanford University"}, {"given_name": "Danqi", "family_name": "Chen", "institution": "Saarland University"}, {"given_name": "Christopher", "family_name": "Manning", "institution": "Stanford University"}, {"given_name": "Andrew", "family_name": "Ng", "institution": "Stanford University"}]}