{"title": "A latent factor model for highly multi-relational data", "book": "Advances in Neural Information Processing Systems", "page_first": 3167, "page_last": 3175, "abstract": "Many data such as social networks, movie preferences or knowledge bases are multi-relational, in that they describe multiple relationships between entities. While there is a large body of work focused on modeling these data, few considered modeling these multiple types of relationships jointly. Further, existing approaches tend to breakdown when the number of these types grows. In this paper, we propose a method for modeling large multi-relational datasets, with possibly thousands of relations. Our model is based on a bilinear structure, which captures the various orders of interaction of the data, but also shares sparse latent factors across different relations. We illustrate the performance of our approach on standard tensor-factorization datasets where we attain, or outperform, state-of-the-art results. Finally, a NLP application demonstrates our scalability and the ability of our model to learn efficient, and semantically meaningful verb representations.", "full_text": "A latent factor model for highly multi-relational data\n\nRodolphe Jenatton\n\nCMAP, UMR CNRS 7641,\n\nEcole Polytechnique, Palaiseau, France\n\njenatton@cmap.polytechnique.fr\n\nNicolas Le Roux\n\nINRIA - SIERRA Project Team,\n\nEcole Normale Sup\u00b4erieure, Paris, France\n\nnicolas@le-roux.name\n\nAntoine Bordes\n\nHeudiasyc, UMR CNRS 7253,\n\nUniversit\u00b4e de Technologie de Compi`egne, France\n\nantoine.bordes@utc.fr\n\nGuillaume Obozinski\n\nINRIA - SIERRA Project Team,\n\nEcole Normale Sup\u00b4erieure, Paris, France\nguillaume.obozinski@ens.fr\n\nAbstract\n\nMany data such as social networks, movie preferences or knowledge bases are\nmulti-relational, in that they describe multiple relations between entities. While\nthere is a large body of work focused on modeling these data, modeling these mul-\ntiple types of relations jointly remains challenging. Further, existing approaches\ntend to breakdown when the number of these types grows. In this paper, we pro-\npose a method for modeling large multi-relational datasets, with possibly thou-\nsands of relations. Our model is based on a bilinear structure, which captures var-\nious orders of interaction of the data, and also shares sparse latent factors across\ndifferent relations. We illustrate the performance of our approach on standard\ntensor-factorization datasets where we attain, or outperform, state-of-the-art re-\nsults. Finally, a NLP application demonstrates our scalability and the ability of\nour model to learn ef\ufb01cient and semantically meaningful verb representations.\n\n1\n\nIntroduction\n\nStatistical Relational Learning (SRL) [7] aims at modeling data consisting of relations between\nentities. Social networks, preference data from recommender systems, relational databases used for\nthe semantic web or in bioinformatics, illustrate the diversity of applications in which such modeling\nhas a potential impact.\nRelational data typically involve different types of relations between entities or attributes. These\nentities can be users in the case of social networks or recommender systems, words in the case of\nlexical knowledge bases, or genes and proteins in the case of bioinformatics ontologies, to name\na few. For binary relations, the data is naturally represented as a so called multi-relational graph\nconsisting of nodes associated with entities and of different types of edges between nodes corre-\nsponding to the different types of relations. Equivalently the data consists of a collection of triplets\nof the form (subject, relation, object), listing the actual relationships where we will call subject and\nobject respectively the \ufb01rst and second term of a binary relation. Relational data typically cumulates\nmany dif\ufb01culties. First, a large number of relation types, some being signi\ufb01cantly more represented\nthan others and possibly concerning only subsets of entities; second, the data is typically noisy and\nincomplete (missing or incorrect relationships, redundant entities); \ufb01nally most datasets are large\nscale with up to millions of entities and billions of links for real-world knowledge bases.\nBesides relational databases, SRL can also be used to model natural language semantics. A standard\nway of representing the meaning of language is to identify entities and relations in texts or speech\nutterances and to organize them. This can be conducted at various scales, from the word or sentence\nlevel (e.g. in parsing or semantic role labeling) to a collection of texts (e.g. in knowledge extraction).\n\n1\n\n\fSRL systems are a useful tool there, as they can automatically extract high level information from\nthe collected data by building summaries [22], sense categorization lexicons [11], ontologies [20],\netc. Progress in SRL would be likely to lead to advances in natural language understanding.\nIn this paper, we introduce a model for relational data and apply it to multi-relational graphs and\nto natural language. In assigning high probabilities to valid relations and low probabilities to all\nthe others, this model extracts meaningful representations of the various entities and relations in\nthe data. Unlike other factorization methods (e.g. [15]), our model is probabilistic which has the\nadvantage of accounting explicitly for the uncertainties in the data. Besides, thanks to a sparse\ndistributed representation of relation types, our model can handle data with a signi\ufb01cantly larger\nnumber of relation types than was considered so far in the literature (a crucial aspect for natural\nlanguage data). We empirically show that this approach ties or beats state-of-the-art algorithms on\nvarious benchmarks of link prediction, a standard test-bed for SRL methods.\n\n2 Related work\n\nA branch of relational learning, motivated by applications such as collaborative \ufb01ltering and link\nprediction in networks, models relations between entities as resulting from intrinsic latent attributes\nof these entities.1 Work in what we will call relational learning from latent attributes (RLA) focused\nmostly on the problem of modeling a single relation type as opposed to trying to model simultane-\nously a collection of relations which can themselves be similar. As re\ufb02ected by several formalisms\nproposed for relational learning [7], it is the latter multi-relational learning problem which is needed\nto model ef\ufb01ciently large scale relational databases. The fact that relations can be similar or related\nsuggests that a superposition of independently learned models for each relation would be highly\ninef\ufb01cient especially since the relationships observed for each relation are extremely sparse.\nRLA translates often into learning an embedding of the entities, which corresponds algebraically to\na matrix factorization problem (typically the matrix of observed relationships). A natural extension\nto learning multiple relations consists in stacking the matrices to be factorized and applying classical\ntensor factorization methods such as CANDECOMP/PARAFAC [25, 8]. This approach, which induces\ninherently some sharing of parameters between both different terms and different relations, has been\napplied successfully [8] and has inspired some probabilistic formulations [4].\nAnother natural extension to learning several relations simultaneously can be to share the common\nembedding or the entities across relations via collective matrix factorization as proposed in RESCAL\n[15] and other related work [18, 23].\nThe simplest form of latent attribute that can be associated to an entity is a latent class: the resulting\nmodel is the classical stochastic blockmodel [26, 17]. Several clustering-based approaches have\nbeen proposed for multi-relational learning: [9] considered a non-parametric Bayesian extension\nof the stochastic blockmodel allowing to automatically infer the number of latent clusters; [14, 28]\nre\ufb01ned this to allow entities to have a mixed clusters membership; [10] introduced clustering in\nMarkov-Logic networks; [24] used a non-parametric Bayesian clustering of entities embedding in a\ncollective matrix factorization formulation. To share parameters between relations, [9, 24, 14, 28]\nand [10] build models that cluster not only entities but relations as well.\nWith the same aim of reducing the number of parameters, the Semantic Matching Energy model\n(SME) of [2] embeds relations as a vector from the same space as the entities and models likely\nrelationships by an energy combining together binary interactions between the relation vector and\neach of the vectors encoding the two terms.\nIn terms of scalability, RESCAL [15], which has been shown to achieve state of the art performance\non several relation datasets, has recently been applied to the knowledge base YAGO [16] thereby\nshowing its ability to scale well on data with very large numbers of entities, although the number\nof relations modeled remained moderate (less than 100). As for SME [2], its modeling of relations\nby vectors allowed it to scale to several thousands of relations. Scalability can be also an issue for\nnonparametric Bayesian models (e.g. [9, 24]) because of the cost of inference.\n\n1This is called Statistical Predicate Invention by [10].\n\n2\n\n\f3 Relational data modeling\n\nWe consider relational data consisting of triplets that encode the existence of relation between two\nentities that we will call the subject and the object. Speci\ufb01cally, we consider a set of ns sub-\n\njects {Si}i\u2208(cid:74)1;ns(cid:75) along with no objects {Ok}k\u2208(cid:74)1;no(cid:75) which are related by some of nr relations\n{Rj}j\u2208(cid:74)1;nr(cid:75). A triplet encodes that the relation Rj holds between the subject Si and the object\nOk, which we will write Rj(Si,Ok) = 1. We will therefore refer to a triplet also as a relationship.\nA typical example which we will discuss in greater detail is in natural language processing where a\ntriplet (Si,Rj,Ok) corresponds to the association of a subject and a direct object through a transitive\nverb. The goal is to learn a model of the relations to reliably predict unseen triplets. For instance,\none might be interested in \ufb01nding a likely relation Rj based only on the subject and object (Si,Ok).\n\n4 Model description\n\nIn this work, we formulate the problem of learning a relation as a matrix factorization problem.\nFollowing a rationale underlying several previous approaches [15, 24], we consider a model in\nwhich entities are embedded in Rp and relations are encoded as bilinear operators on the enti-\nties. More precisely, we assume that the ns subjects (resp. no objects) are represented by vectors\nof Rp, stored as the columns of the matrix S (cid:44) [s1, . . . , sns] \u2208 Rp\u00d7ns (resp. as the columns of\nO (cid:44) [o1, . . . , ono] \u2208 Rp\u00d7no). Each of the p-dimensional representations si, ok will have to be\nlearned. The relations are represented by a collection of matrices (Rj)1\u2264j\u2264nr, with Rj \u2208 Rp\u00d7p,\nwhich together form a three-dimensional tensor.\n\nWe consider a model of the probability of the event(cid:8)Rj(Si,Ok) = 1(cid:9). Assuming \ufb01rst that si\nand ok are \ufb01xed, our model is derived from a logistic model P[Rj(Si,Ok) = 1] (cid:44) \u03c3(cid:0)\u03b7(j)\n(cid:1), with\n\nik is a linear function of the tensor product si \u2297 ok\n\u03c3(t) (cid:44) 1/(1 + e\u2212t). A natural form for \u03b7(j)\nik = (cid:104)si, Rjok(cid:105) where (cid:104)\u00b7,\u00b7(cid:105) is the usual inner product in Rp. If we think now\nwhich we can write \u03b7(j)\nof learning si, Rj and ok for all (i, j, k) simultaneously, this model learns together the matrices\nRj and optimal embeddings si, ok of the entities so that the usual logistic regressions based on\nsi\u2297ok predict well the probability of the observed relationships. This is the initial model considered\nin [24] and it matches the model considered in [16] if the least-square loss is substituted to the\nlogistic loss. We will re\ufb01ne this model in two ways: \ufb01rst by rede\ufb01ning the term \u03b7(j)\nik as a function\n(cid:44) E(si, Rj, ok) taking into account the different orders of interactions between si, ok and\n\u03b7(j)\nik\nRj, second, by parameterizing the relations Rj by latent \u201crelational\u201d factors that reduce the overall\nnumber of parameters of the model.\n\nik\n\n4.1 A multiple order log-odds ratio model\n\nOne way of thinking about the probability of occurrence of a speci\ufb01c relationship corresponding to\nthe triplet (Si,Rj,Ok) is as resulting (a) from the marginal propensity of individual entities Si, Ok\nto enter relations and the marginal propensity of relations Rj to occur, (b) from 2-way interactions\nof (Si,Rj), (Rj,Ok) corresponding to entities tending to occur marginally as left of right terms\nof a relation (c) from 2-way interactions of pairs of entities (Si,Ok) that overall tend to have more\nrelations together, and (d) the 3-way dependencies between (Si,Rj,Ok).\nIn NLP, we often refer to these as respectively unigram, bigram and trigram terms, a terminology\nwhich we will reuse in the rest of the paper. We therefore design E(si, Rj, ok) to account for these\ninteractions of various orders, retaining only 2 terms involving Rj .\nIn particular, introducing new parameters y, y(cid:48), z, z(cid:48) \u2208 Rp, we de\ufb01ne \u03b7(j)\n\nik = E(si, Rj, ok) as\n\nE(si, Rj, ok) (cid:44) (cid:104)y, Rj y(cid:48)(cid:105) + (cid:104)si, Rj z(cid:105) + (cid:104)z(cid:48), Rj ok(cid:105) + (cid:104)si, Rj ok(cid:105),\n\n(1)\nwhere (cid:104)y, Rj y(cid:48)(cid:105), (cid:104)si, Rj z(cid:105)+(cid:104)z(cid:48), Rj ok(cid:105) and (cid:104)si, Rj ok(cid:105) are the uni-, bi- and trigram terms. This\nparametrization is redundant in general given that E(si, Rj, ok) is of the form (cid:104)(si + z), Rj (ok +\nz(cid:48))(cid:105) + bj; but it is however useful in the context of a regularized model (see Section 5).\n\n2This is motivated by the fact that we are primarily interested in modelling the relations terms, and that it is\n\nnot necessary to introduce all terms to fully parameterize the model.\n\n3\n\n\f4.2 Sharing parameters across relations through latent factors\n\nWhen learning a large number of relations, the number of observations for many relations can be\nquite small, leading to a risk of over\ufb01tting. Sutskever et al. [24] addressed this issue with a non-\nparametric Bayesian model inducing clustering of both relations and entities. SME [2] proposed to\nembed relations as vectors of Rp, like entities, to tackle problems with hundreds of relation types.\nWith a similar motivation to decrease the overall number of parameters, instead of using a general\nparameterization of the matrices Rj as in RESCAL [16], we require that all Rj decompose over a\ncommon set of d rank one matrices {\u0398r}1\u2264r\u2264d representing some canonical relations:\n\nRj =\n\n\u03b1j\n\nr\u0398r,\n\nfor some sparse \u03b1j \u2208 Rd\n\nand \u0398r = urv(cid:62)\n\nr\n\nfor ur, vr \u2208 Rp.\n\n(2)\n\nr=1\n\nThe combined effect of (a) the sparsity of the decomposition and (b) the fact that d (cid:28) nr leads\nto sharing parameters across relations. Further, constraining \u0398r to be the outer product urv(cid:62)\nr also\nspeeds up all computations relying on linear algebra.\n\nd(cid:88)\n\n5 Regularized formulation and optimization\nDenoting P (resp. N ) the set of indices of positively (resp. negatively) labeled relations, the likeli-\nhood we seek to maximize is\n\nL (cid:44) (cid:89)\n\nP[Rj(Si,Ok) = 1]\n\nThe log-likelihood is thus log(L) = (cid:80)\n\n(i,j,k)\u2208P\n\nik )), with\nik = E(si, Rj, ok). To properly normalize the terms appearing in (1) and (2), we carry out\n\u03b7(j)\nthe minimization of the negative log-likelihood over a speci\ufb01c constraint set, namely\n\n(i,j,k)\u2208P \u03b7(j)\n\n(i(cid:48),j(cid:48),k(cid:48))\u2208N\n\nP[Rj(cid:48)(Si(cid:48),Ok(cid:48)) = 0] .\n\n\u00b7 (cid:89)\nik \u2212(cid:80)\n\uf8f1\uf8f2\uf8f3(cid:107)\u03b1j(cid:107)1 \u2264 \u03bb, \u0398r = ur \u00b7 v(cid:62)\nsj, ok, y, y(cid:48), z, ur and vr in the ball(cid:8)w;(cid:107)w(cid:107)2 \u2264 1(cid:9).\n\n(i,j,k)\u2208P\u222aN log(1 + exp(\u03b7(j)\n\nz = z(cid:48), O = S,\n\nr ,\n\n\u2212 log(L), with\n\nmin\n\nS,O,{\u03b1j},\n{\u0398r}y,y(cid:48),z,z(cid:48)\n\nWe chose to constrain \u03b1 in (cid:96)1-norm based on preliminary experiments suggesting that it led to better\nresults that the regularization in (cid:96)2-norm. The regularization parameter \u03bb \u2265 0 controls the sparsity of\nthe relation representations in (2). The equality constraints induce a shared representations between\nsubject and objects which were shown to improve the model in preliminary experiments. Given the\nr, is\nfact that the model is conditional on a pair (si, ok), only a single scale parameter, namely \u03b1j\nr(cid:104)si, \u0398rok(cid:105), which motivates all the Euclidean unit ball constraints.\nnecessary in the product \u03b1j\n\n5.1 Algorithmic approach\nGiven the large scale of the problems we are interested in (e.g., |P| \u2248 106), and since we can\nproject ef\ufb01ciently onto the constraint set (both the projections onto (cid:96)1- and (cid:96)2-norm balls can be\nperformed in linear time [1]), our optimization problem lends itself well to a stochastic projected\ngradient algorithm [3].\nIn order to speed up the optimization, we use several practical tricks. First, we consider a stochastic\ngradient descent scheme with mini-batches containing 100 triplets. Second, we use stepsizes of the\nform a/(1 + k) with k the iteration number and a a scalar (common to all parameters) optimized\nover a logarithmic grid on a validation set.3\nAdditionally, we cannot treat the NLP application (see Sec. 8) as a standard tensor factorization\nproblem. Indeed, in that case, we only have access to the positively labeled triplets P. Following [2],\nwe generate elements in N by considering triplets of the form {(i, j(cid:48), k)}, j(cid:48) (cid:54)= j for each (i, j, k) \u2208\nP. In practice, for each positive triplet, we sample a number of arti\ufb01cial negative triplets containing\nthe same subject and object as our positive triplet but different verbs. This allowed us to change\n\n3The code is available under an open-source license from http://goo.gl/TGYuh.\n\n4\n\n\fthe problem into a multiclass one where the goal was to correctly classify the \u201cpositive\u201d verb, in\ncompetition with the \u201cnegative\u201d ones.\nThe standard approach for this problem is to use a multinomial logistic function. However, such\na function is highly sensitive to the particular choice of negative verbs and using all the verbs as\nnegative ones would be too costly. Another more robust approach consists in using the likelihood\nfunction de\ufb01ned above where we try to classify the positive verbs as a valid relationship and the neg-\native ones as invalid relationships. Further, this approximation to the multinomial logistic function\nis asymptotically unbiased.\nFinally, we observed that it was advantageous to down-weight the in\ufb02uence of the negative verbs to\navoid swamping the in\ufb02uence of the positive ones.\n\n6 Relation to other models\n\nOur model is closely related to several other models. First, if d is large, the parameters of the Rj are\ndecoupled and the RESCAL model is retrieved (up to a change of loss function).\n(cid:80)d\nSecond, our model is also related to classical tensor factorization model such as PARAFAC which ap-\nproximate the tensor [Rk(Si,Oj)]i,j,k in the least-square sense by a low rank tensor \u02dcH of the form\nr=1 \u03b1r \u2297 \u03b2r \u2297 \u03b3r for (\u03b1r, \u03b2r, \u03b3r)\u2208 Rnr\u00d7ns\u00d7no. The parameterization of all Rj as linear combi-\nbe the low rank tensor R =(cid:80)d\nnations of d rank one matrices is in fact equivalent to constraining the tensor R = {Rj}j\u2208(cid:74)1;nr(cid:75) to\ncan be written also as(cid:80)d\nr=1 \u03b1r \u2297 ur \u2297 vr. As a consequence, the tensor of all trigram terms4\nr=1 \u03b1r \u2297 \u03b2r \u2297 \u03b3r with \u03b2r = S(cid:62)ur and \u03b3r = O(cid:62)vr. This shows that our\nmodel is a particular form of tensor factorization which reduces to PARAFAC (up to a change of loss\nfunction) when p is suf\ufb01ciently large.\nFinally, the approach considered in [2] seems a priori quite different from ours, in particular since\nrelations are in that work embedded as vectors of Rp like the entities as opposed to matrices of\nRp\u00d7p in our case. This choice can be detrimental to model complex relation patterns as we show\nin Section 7. In addition, no parameterization of the model [2] is able of handling both bigram and\ntrigram interactions as we propose.\n\n7 Application to multi-relational benchmarks\n\nWe report in this section the performance of our model evaluated on standard tensor-factorization\ndatasets, which we \ufb01rst brie\ufb02y describe.\n\n7.1 Datasets\n\nKinships. Australian tribes are renowned among anthropologists for the complex relational struc-\nture of their kinship systems. This dataset, created by [6], focuses on the Alyawarra, a tribe from\nCentral Australia. 104 tribe members were asked to provide the kinship terms they used for one an-\nother. This results in graph of 104 entities and 26 relation types, each of them depicting a different\nkinship term, such as Adiadya or Umbaidya. See [6] or [9] for more details.\n\nUMLS. This dataset contains data from the Uni\ufb01ed Medical Language System semantic work\ngathered by [12]. This consists in a graph with 135 entities and 49 relation types. The entities\nare high-level concepts like \u2019Disease or Syndrome\u2019, \u2019Diagnostic Procedure\u2019, or \u2019Mammal\u2019. The\nrelations represent verbs depicting causal in\ufb02uence between concepts like \u2019affect\u2019 or \u2019cause\u2019.\n\nNations. This dataset groups 14 countries (Brazil, China, Egypt, etc.) with 56 binary relation\ntypes representing interactions among them like \u2019economic aid\u2019, \u2019treaties\u2019 or \u2019rel diplomacy\u2019, and\n111 features describing each country, which we treated as 111 additional entities interacting with\nthe country through an additional \u2019has feature\u2019 relation5. See [21] for details.\n\n4Other terms can be decomposed in a similar way.\n5The resulting new relationships were only used for training, and not considered at test time.\n\n5\n\n\fDatasets\nKinships\n\nUMLS\n\nNations\n\nArea under PR curve\n\nLog-likelihood\n\nArea under PR curve\n\nLog-likelihood\n\nArea under PR curve\n\nLog-likelihood\n\nOur approach\n0.946 \u00b1 0.005\n-0.029 \u00b1 0.001\n0.990 \u00b1 0.003\n-0.002 \u00b1 0.0003\n0.909 \u00b1 0.009\n-0.202 \u00b1 0.008\n\nRESCAL [16]\n\n0.95\nN/A\n0.98\nN/A\n0.84\nN/A\n\nMRC [10]\n\n0.84\n\n0.98\n\n-0.045 \u00b1 0.002\n-0.004 \u00b1 0.001\n-0.311 \u00b1 0.022\n\n0.75\n\nSME [2]\n\nN/A\n\n0.907 \u00b1 0.008\n0.983 \u00b1 0.003\n0.883 \u00b1 0.02\n\nN/A\n\nN/A\n\nTable 1: Comparisons of the performance obtained by our approach, RESCAL [16], MRC [10] and\nSME [2] over three standard datasets. The results are computed by 10-fold cross-validation.\n\n7.2 Results\n\nThese three datasets are relatively small-scale and contain only a few relationships (in the order of\ntens). Since our model is primarily designed to handle a large number of relationships (see Sec. 4.2),\nthis setting is the most favorable to evaluate the potential of our approach. As reported in Table 1,\nour method does nonetheless yield better or equally good performance as previous state-of-the-art\ntechniques, both in terms of area under the precision-recall curve (AUC) and log-likelihood (LL).\nThe results displayed in Table 1 are computed by 10-fold cross-validation6, averaged over 10 random\nsplits of the datasets (90% for cross-validation and 10% for testing). We chose to compare our model\nwith RESCAL [16], MRC [10] and SME [2] because they achieved the best published results on these\nbenchmarks in terms of AUC and LL, to the best of our knowledge.\nInterestingly, the trigram term from (1) is essential to obtain good performance on Kinships (with\nthe trigram term removed, we obtain 0.16 in AUC and \u22120.14 in LL), thus showing the need for\nmodeling 3-way interactions in complex relational data. Moreover, and as expected due to the low\nnumber of relations, the value of \u03bb selected by cross-validation is quite large (\u03bb = nr \u00d7 d), and\nas consequence does not lead to sparsity in (2). Results on this dataset also exhibits the bene\ufb01t of\nmodeling relations with matrices instead of vectors as does SME [2].\nZhu [28] recently reported results on Nations and Kinships evaluated in terms of area under the\nreceiver-operating-characteristic curve instead of area under the precision-recall curve as we display\nin Table 1. With this other metric, our model obtains 0.953 on Nations and 0.992 on Kinships and\nhence outperforms Zhu\u2019s approach, which achieves 0.926 and 0.962 respectively.\n\n8 Learning semantic representations of verbs\n\nBy providing an approach to model the relational structure of language, SRL can be of great use for\nlearning natural language semantics. Hence, this section proposes an application of our method on\ntext data from Wikipedia for learning a representation of words, with a focus on verbs.\n\n8.1 Experimental setting\n\nData. We collected this data in two stages. First, the SENNA software7 [5] was used to per-\nform part-of-speech tagging, chunking, lemmatization8 and semantic role labeling on \u22482,000,000\nWikipedia articles. This data was then \ufb01ltered to only select sentences for which the syntactic struc-\nture was (subject, verb, direct object) with each term of the triplet being a single word from the\nWordNet lexicon [13]. Subjects and direct objects ended up being all single nouns, whose dictio-\nnary size is 30,605. The total number of relations in this dataset (i.e. the number of verbs) is 4,547:\nthis is much larger than for previously published multi-relational benchmarks. We kept 1,000,000\nsuch relationships to build a training set, 50,000 for a validation set and 250,000 for test. All triplets\nare unique and we made sure that all words appearing in the validation or test sets were occurring in\nthe training set.9\n\n6The values of \u03bb, d and p are searched in nr \u00d7 d \u00b7 {0.05, 0.1, 0.5, 1}, {100, 200, 500} and {10, 25, 50}.\n7Available from ronan.collobert.com/senna/.\n8Lemmatization was carried out using NLTK (nltk.org) and transforms a word into its base form.\n9The data set is available under an open-source license from http://goo.gl/TGYuh.\n\n6\n\n\fsynonyms not considered\n\nbest synonyms considered\n\np@5\n0.78\n0.77\n0.72\n\np@20 median/mean rank\n0.95\n0.95\n0.83\n\n19 / 96.7\n19 / 99.2\n17 / 157.7\n\np@5\n0.89\n0.89\n0.87\n\np@20\n0.98\n0.98\n0.95\n\nmedian/mean rank\n\n50 / 195.0\n56 / 199.6\n48 / 517.4\n\nOur approach\n\nSME [2]\nBigram\n\nTable 2: Performance obtained on the NLP dataset by our approach, SME [2] and a bigram model.\nDetails about the statistics of the table are given in the text.\nPractical training setup. During the training phase, we optimized over the validation set var-\nious parameters, namely, the size p \u2208 {25, 50, 100} of the representations, the dimension d \u2208\n{50, 100, 200} of the latent decompositions (2), the value of the regularization parameter \u03bb as a\nfraction {1, 0.5, 0.1, 0.05, 0.01} of nr \u00d7 d, the stepsize in {0.1, 0.05, 0.01} and the weighting of the\nnegative triplets. Moreover, to speed up the training, we gradually increased the number of sampled\nnegative verbs (cf. Section 5.1), from 25 up to 50, which had the effect of re\ufb01ning the training.\n\n8.2 Results\n\nVerb prediction. We \ufb01rst consider a direct evaluation of our approach based on the test set of\n250,000 instances by measuring how well we predict a relevant and meaningful verb given a pair\n(subject, direct object). To this end, for each test relationship, we rank all verbs using our probability\nestimates given a pair (subject, direct object). Table 2 displays our results with two kinds of metrics,\nnamely, (1) the rank of the correct verb and (2) the fraction of test examples for which the correct\nverb is ranked in the top z% of the list. The latter criterion is referred to as p@z.\nIn order to\nevaluate if some language semantics is captured by the representations, we also consider a less\nconservative approach where, instead of focusing on the correct verb only, we measure the minimum\nrank achieved over its set of synonyms obtained from WordNet. Our method is compared with that\nof SME [2], which was shown to scale well on data with large sets of relations, and with a bigram\nmodel, which estimates the probabilities of the pairs (subject, verb) and (verb, direct object).\nThe \ufb01rst observation is that the task of verb prediction can be quite well addressed by a simple model\nbased on 2-way interactions, as shown by the good median rank obtained by the bigram model. This\nis con\ufb01rmed by the mild in\ufb02uence of the trigram term on the performance of our model. On this data,\nwe experienced that using bigram interactions in our energy function was essential to achieve good\npredictions. However, the drop in the mean rank between our approach and the bigram-only model\nstill indicates that many examples do need a richer model to be correctly handled. By comparison,\nwe tend to consistently match or improve upon the performance of SME. Remarquably, model\nselection led to the choice of \u03bb = 0.1 \u00b7 nr \u00d7 d for which the coef\ufb01cients \u03b1 of the representations (2)\nare sparse in the sense they are dominated by few large values (e.g., the top 2% of the largest values\nof \u03b1 account for about 25% of the total (cid:96)1-norm (cid:107)\u03b1(cid:107)1).\n\nFigure 1: Precision-recall curves for the task of lexical similarity classi\ufb01cation. The curves are\ncomputed based on different similarity measures between verbs, namely, our approach, SME [2],\nCollobert et al. [5] and the best (out of three) WordNet similarity measure [13]. Details about the\ntask can be found in the text.\n\n7\n\n00.20.40.60.8100.20.40.60.81RecallPrecisionPredicting class 4 Our approachSMECollobert et al.Best WordNet00.20.40.60.810.10.20.30.40.50.60.70.80.91RecallPrecisionPredicting classes 3 and 4 Our approachSMECollobert et al.Best WordNet\fAUC (class 4)\n\nAUC (classes 3&4)\n\nOur approach\n\n0.40\n0.54\n\n0.21\n0.36\n\nSME [2] Collobert et al. [5] Best WordNet [19]\n\n0.31\n0.48\n\n0.40\n0.59\n\nTable 3: Performance obtained on a task of lexical similarity classi\ufb01cation [27], where we compare\nour approach, SME [2], Collobert et al.\u2019s word embeddings [5] and the best (out of 3) WordNet\nSimilarity measure [19] using area under the precision-recall curve. Details are given in the text.\n\nLexical similarity classi\ufb01cation. Our method learns latent representations for verbs and imposes\nthem some structure via shared parameters, as shown in Section 4.2. This should lead to similar\nrepresentations for similar verbs. We consider the task of lexical similarity classi\ufb01cation described\nin [27] to evaluate this hypothesis. Their dataset consists of 130 pairs of verbs labeled by humans\nwith a score in {0, 1, 2, 3, 4}. Higher scores means a stronger semantic similarity between the verbs\ncomposing the pair. For instance, (divide,split) is labeled 4, while (postpone,show)\nhas a score of 0.\nBased on the pairwise Euclidean distances10 between our learned verb representations Rj, we try to\npredict the class 4 (and also the \u201cmerged\u201d classes {3, 4}) by using the assumption that the smallest\nthe distance between Ri and Rj, the more likely the pair (i, j) should be labeled as 4. We compare\nto representations learnt by [2] on the same training data, to word embeddings of [5] (which are\nconsidered as ef\ufb01cient features in Natural Language Processing), and with three similarity measures\nprovided by WordNet Similarity [19]. For the latest, we only display the best one, named \u201cpath\u201d,\nwhich is built by counting the number of nodes along the shortest path between the senses in the\n\u201cis-a\u201d hierarchies of WordNet.\nWe report our results on precision-recall curves displayed in Figure 1 and the corresponding areas\nunder the curve (AUC) in Table 3. Even though we tend to miss the \ufb01rst few pairs, we compare\nfavorably to [2] and [5] and our AUC is close to the reference established by WordNet Similarity.\nOur method is capable of encoding meaningful semantic embeddings for verbs, even though it has\nbeen trained on noisy, automatically collected data and in spite of the fact that it was not our primary\ngoal that distance in parameter space should satisfy any condition. Performance might be improved\nby training on cleaner triplets, such as those collected by [11].\n\n9 Conclusion\n\nDesigning methods capable of handling large amounts of linked relations seems necessary to be able\nto model the wealth of relations underlying the semantics of any real-world problems. We tackle\nthis problem by using a shared representation of relations naturally suited to multi-relational data,\nin which entities have a unique representation shared between relation types, and where we propose\nthat relation themselves decompose over latent \u201crelational\u201d factors. This new approach ties or beats\nstate-of-the art models on both standard relational learning problems and an NLP task. The decom-\nposition of relations over latent factors allows a signi\ufb01cant reduction of the number of parameters\nand is motivated both by computational and statistical reasons. In particular, our approach is quite\nscalable both with respect to the number of relations and to the data samples.\nOne might wonder about the relative importance of the various terms in our formulation. Interest-\ningly, though the presence of the trigram term was crucial in the tensor factorization problems, it\nplayed a marginal role in the NLP experiment, where most of the information was contained in the\nbigram and unigram terms.\nFinally, we believe that exploring the similarities of the relations through an analysis of the latent\nfactors could provide some insight on the structures shared between different relation types.\n\nAcknowledgments\n\nThis work was partially funded by the Pascal2 European Network of Excellence. NLR and RJ are\nsupported by the European Research Council (resp., SIERRA-ERC-239993 & SIPA-ERC-256919).\n\n10Other distances could of course be considered, we choose the Euclidean metric for simplicity.\n\n8\n\n\fReferences\n[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foun-\n\ndations and Trends in Machine Learning, 4(1):1\u2013106, 2011.\n\n[2] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. A semantic matching energy function for learning with\n\nmulti-relational data. Machine Learning, 2012. To appear.\n\n[3] L. Bottou and Y. LeCun. Large scale online learning. In Advances in Neural Information Processing\n\nSystems, volume 16, pages 217\u2013224, 2004.\n\n[4] W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. Journal of\n\nMachine Learning Research - Proceedings Track, 5:89\u201396, 2009.\n\n[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. JMLR, 12:2493\u20132537, 2011.\n\n[6] W. Denham. The detection of patterns in Alyawarra nonverbal behavior. PhD thesis, 1973.\n[7] L. Getoor and B. Taskar.\n\nIntroduction to Statistical Relational Learning (Adaptive Computation and\n\nMachine Learning). The MIT Press, 2007.\n\n[8] R. A. Harshman and M. E. Lundy. Parafac: parallel factor analysis. Comput. Stat. Data Anal., 18(1):39\u2013\n\n72, Aug. 1994.\n\n[9] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with\n\nan in\ufb01nite relational model. In Proc. of AAAI, pages 381\u2013388, 2006.\n\n[10] S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th international confer-\n\nence on Machine learning, pages 433\u2013440, 2007.\n\n[11] A. Korhonen, Y. Krymolowski, and T. Briscoe. A large subcategorization lexicon for natural language\n\nprocessing applications. In Proceedings of LREC, 2006.\n\n[12] A. T. McCray. An upper level ontology for the biomedical domain. Comparative and Functional Ge-\n\nnomics, 4:80\u201388, 2003.\n\n[13] G. Miller. WordNet: a Lexical Database for English. Communications of the ACM, 38(11):39\u201341, 1995.\n[14] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction. In Ad-\n\nvances in Neural Information Processing Systems 22, pages 1276\u20131284. 2009.\n\n[15] M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi-relational\n\ndata. In Proceedings of the 28th Intl Conf. on Mach. Learn., pages 809\u2013816, 2011.\n\n[16] M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing YAGO: scalable machine learning for linked data. In\n\nProc. of the 21st intl conf. on WWW, pages 271\u2013280, 2012.\n\n[17] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of\n\nthe American Statistical Association, 96(455):1077\u20131087, 2001.\n\n[18] A. Paccanaro and G. Hinton. Learning distributed representations of concepts using linear relational\n\nembedding. IEEE Trans. on Knowl. and Data Eng., 13:232\u2013244, 2001.\n\n[19] T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet:: Similarity: measuring the relatedness of con-\n\ncepts. In Demonstration Papers at HLT-NAACL 2004, pages 38\u201341, 2004.\n\n[20] H. Poon and P. Domingos. Unsupervised ontology induction from text. In Proceedings of the 48th Annual\n\nMeeting of the Association for Computl Linguistics, pages 296\u2013305, 2010.\n\n[21] R. J. Rummel. Dimensionality of nations project: Attributes of nations and behavior of nation dyads. In\n\nICPSR data \ufb01le, pages 1950\u20131965. 1999.\n\n[22] D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen. Document summarization using conditional random\n\n\ufb01elds. In Proc. of the 20th Intl Joint Conf. on Artif. Intel., pages 2862\u20132867, 2007.\n\n[23] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization.\n\nSIGKDD\u201908, pages 650\u2013658, 2008.\n\nIn Proc. of\n\n[24] I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. Modelling relational data using bayesian clustered\n\ntensor factorization. In Adv. in Neur. Inf. Proc. Syst. 22, 2009.\n\n[25] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279\u2013311, 1966.\n[26] Y. J. Wang and G. Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American\n\nStatistical Association, 82(397), 1987.\n\n[27] D. Yang and D. M. W. Powers. Verb similarity on the taxonomy of wordnet. Proceedings of GWC-06,\n\npages 121\u2013128, 2006.\n\n[28] J. Zhu. Max-margin nonparametric latent feature models for link prediction. In Proceedings of the 29th\n\nIntl Conference on Machine Learning, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1455, "authors": [{"given_name": "Rodolphe", "family_name": "Jenatton", "institution": null}, {"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Antoine", "family_name": "Bordes", "institution": null}, {"given_name": "Guillaume", "family_name": "Obozinski", "institution": null}]}