{"title": "DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 15347, "page_last": 15357, "abstract": "In this paper, we study the problem of learning probabilistic logical rules for inductive and interpretable link prediction. Despite the importance of inductive link prediction, most previous works focused on transductive link prediction and cannot manage previously unseen entities. Moreover, they are black-box models that are not easily explainable for humans. We propose DRUM, a scalable and differentiable approach for mining first-order logical rules from knowledge graphs that resolves these problems. We motivate our method by making a connection between learning confidence scores for each rule and low-rank tensor approximation. DRUM uses bidirectional RNNs to share useful information across the tasks of learning rules for different relations. We also empirically demonstrate the efficiency of DRUM over existing rule mining methods for inductive link prediction on a variety of benchmark datasets.", "full_text": "DRUM: End-To-End Differentiable Rule Mining On\n\nKnowledge Graphs\n\nAli Sadeghian *1, Mohammadreza Armandpour*2, Patrick Ding,2 Daisy Zhe Wang,1\n\n{asadeghian, daisyw}@ufl.edu, {armand, patrickding}@stat.tamu.edu\n\n1 Department of Computer Science, University of Florida\n\n2 Department of Statistics, Texas A&M University\n\nAbstract\n\nIn this paper, we study the problem of learning probabilistic logical rules for\ninductive and interpretable link prediction. Despite the importance of inductive link\nprediction, most previous works focused on transductive link prediction and cannot\nmanage previously unseen entities. Moreover, they are black-box models that are\nnot easily explainable for humans. We propose DRUM, a scalable and differentiable\napproach for mining \ufb01rst-order logical rules from knowledge graphs which resolves\nthese problems. We motivate our method by making a connection between learning\ncon\ufb01dence scores for each rule and low-rank tensor approximation. DRUM uses\nbidirectional RNNs to share useful information across the tasks of learning rules\nfor different relations. We also empirically demonstrate the ef\ufb01ciency of DRUM\nover existing rule mining methods for inductive link prediction on a variety of\nbenchmark datasets.\n\n1\n\nIntroduction\n\nKnowledge bases store structured information about real-world people, locations, companies and\ngovernments, etc. Knowledge base construction has attracted the attention of researchers, foundations,\nindustry, and governments [11, 13, 34, 38]. Nevertheless, even the largest knowledge bases remain\nincomplete due to the limitations of human knowledge, web corpora, and extraction algorithms.\nNumerous projects have been developed to shorten the gap between KBs and human knowledge. A\npopular approach is to use the existing elements in the knowledge graph to infer the existence of new\nones. There are two prominent directions in this line of research: representation learning that obtains\ndistributed vectors for all entities and relations in the knowledge graph [12, 31, 33], and rule mining\nthat uses observed co-occurrences of frequent patterns in the knowledge graph to determine logical\nrules [5, 15]. An example of knowledge graph completion with logical rules is shown in Figure 1.\nOne of the main advantages of logic-learning based methods for link prediction is that they can\nbe applied to both transductive and inductive problems while representation learning methods like\nthat of Bordes et al. [4] and Yang et al. [40] cannot be employed in inductive scenarios. Consider\nthe scenario in Figure 1, and suppose that at training time our knowledge base does not contain\ninformation about Obama\u2019s family. Representation learning techniques need to be retrained on the\nwhole knowledge base in order to \ufb01nd the answer. In contrast rule mining methods can transfer\nreasoning to unseen facts.\nAdditionally,\nlearning logical rules provides us with interpretable reasoning for predictions\nwhich is not the case for the embedding based method. This interpretability can keep hu-\nmans in the loop, facilitate debugging, and increase user trustworthiness. More importantly,\n\n*Authors contributed equally\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frules allow domain knowledge transfer by enabling the addition of extra rules by experts, a\nstrong advantage over representation learning models in scenarios with little or low-quality data.\nMining rules have traditionally relied on pre-\nde\ufb01ned statistical measures such as support and\ncon\ufb01dence to assess the quality of rules. These\nare \ufb01xed heuristic measures, and are not optimal\nfor various use cases in which one might want\nto use the rules. For example, using standard\ncon\ufb01dence is not necessarily optimal for statis-\ntical relational learning. Therefore, \ufb01nding a\nmethod that is able to simultaneously learn rule\nstructures as well as appropriate scores is cru-\ncial. However, this is a challenging task because\nthe method needs to \ufb01nd an optimal structure in\na large discrete space and simultaneously learn\nproper score values in a continuous space. Most previous approaches address parts of this problem\n[9, 20, 22, 39] but are not able to learn both structure and scores together, with the exception of Yang\net al. [41].\nIn this paper we propose DRUM, a fully differentiable model to learn logical rules and their related\ncon\ufb01dence scores. DRUM has signi\ufb01cant importance because it not only addresses the aforemen-\ntioned challenges, but also allows gradient based optimization to be employed for inductive logic\nprogramming tasks.\nOur contributions can be summarized as: 1) An end-to-end differentiable rule mining model that is\nable to learn rule structures and scores simultaneously; 2) We provide a connection between tensor\ncompletion and the estimation of con\ufb01dence scores; 3) We theoretically show that our formulation\nis expressive enough at \ufb01nding the rule structures and their related con\ufb01dences; 4) Finally, we\ndemonstrate that our method outperforms previous models on benchmark knowledge bases, both on\nthe link prediction task and in terms of rule quality.\n\nFigure 1: Using logical rules for knowledge base\nreasoning\n\nA \ufb01rst order logical rule is of the form B =\u21d2 H, where B =(cid:86)\n\n2 Problem Statement\nDe\ufb01nitions We model a knowledge graph as a collection of facts G = {(s, r, o)|s, o \u2208 E, r \u2208 R},\nwhere E and R represent the set of entities and relations in the knowledge graph, respectively.\ni Bi(\u00b7 ,\u00b7) is a conjunction of atoms Bi,\ne.g., livesIn(\u00b7 ,\u00b7), called the Body, and H is a speci\ufb01c predicate called the head. A rule is connected\nif every atom in the rule shares at least one variable with another atom, and a rule is closed if each\nvariable in the rule appears in at least two atoms.\nRule Mining We address the problem of learning \ufb01rst-order logical Horn clauses from a knowledge\ngraph. In particular we are interested in mining closed and connected rules. These assumptions\nensure \ufb01nding meaningful rules that are human understandable. Connectedness also prevents \ufb01nding\nrules with unrelated relations.\nFormally, we aim to \ufb01nd all T \u2208 N and relations B1, B2,\u00b7\u00b7\u00b7 , BT , H as well as a con\ufb01dence value\n\u03b1 \u2208 R, where:\n\nB1(x, z1) \u2227 B2(z1, z2) \u2227 \u00b7\u00b7\u00b7 \u2227 BT (zT \u22121 , y) =\u21d2 H(x, y) : \u03b1,\n\n(1)\n\nwhere, zis are variables that can be substituted with entities. This requires searching a discrete space\nto \ufb01nd Bis and searching a continuous space to learn \u03b1 for every particular rule.\n\n3 Related work\n\nMining Horn clauses has been previously studied in the Inductive Logic Programming (ILP) \ufb01eld,\ne.g, FOIL [29], MDIE [26] and Inspire [32]. Given a background knowledge base, ILP provides a\nframework for learning on multi-relational data. However, despite the strong representation powers\nof ILP, it requires both positive and negative examples and does not scale to large datasets. This is a\nhuge drawback since most knowledge bases are large and contain only positive facts.\n\n2\n\nbrotherOf(X, Z),fatherOf(Z, Y)uncleOf(X, Y)Logical Rule:Original KB:fatherOf(George W. Bush, Jenna Bush Hager)brotherOf(Jeb Bush, George W. Bush)New Facts:X= Jeb BushY= Jenna Bush HagerQuestion: Which person X is uncle of Y?X = George ObamaY = Sasha ObamafatherOf(Barack Obama, Sasha Obama)brotherOf(George Obama, Barack Obama)\fRecent rule mining methods such as AMIE+ [15] and Ontological Path\ufb01nding (OP) [5] use prede\ufb01ned\nmetrics such as con\ufb01dence and support and take advantage of various parallelization and partitioning\ntechniques to speed up the counting process. However, they still suffer from the inherent limitations\nof relying on prede\ufb01ned con\ufb01dence and discrete counting.\nMost recent knowledge base rule mining approaches fall under the same category as ILP and OP.\nHowever, Yang et al. [40] show that one can also use graph embeddings to mine rules. They introduce\nDistMult, a simple bilinear model for learning entity and relation representations. The relation\nrepresentations learned via the bilinear model can capture compositional relational semantics via\nmatrix multiplications. For example, if the rule B1(x, y) \u2227 B2(y, z) =\u21d2 H(x, z) holds, then\nintuitively so should AB1 AB2 \u2248 AH. To mine rules, they use the Frobenius norm to search for all\npossible pairs of relations with respect to their compositional relevance to each head. In a more recent\napproach Omran et al. [28] improve this method by leveraging pruning techniques and computing\ntraditional metrics to scale it up to larger knowledge bases.\nIn [16] the authors proposed a RESCAL-based model to learn from paths in KGs. More recently,\nYang et al. [41] provide the \ufb01rst fully differentiable rule mining method based on TensorLog [6],\nNeural LP. They estimate the graph structure via constructing TensorLog operators per relation using\na portion of the knowledge graph. Similar to us, they chain these operators to compute a score for\neach triplet, and learn rules by maximizing this score. As we explain in Section 4.1, this formulation\nis bounded to a \ufb01xed length of rules. To overcome this limitation, Neural LP uses an LSTM and\nattention mechanisms to learn variable rule lengths. However, it can be implied from Theorem 1 that\nits formulation has theoretical limitations on the rules it can produce.\nThere are some other interesting works [7, 14, 25, 30] which learn rules in a differentiable manner.\nHowever, they need to learn embeddings for each entity in the graph and they do link prediction\nnot only based on the learned rules but also the embeddings. Therefore we exclude them from our\nexperiment section.\n\n4 Methodology\n\nTo provide intuition about each part of our algorithm we start with a vanilla solution to the problem.\nWe then explain the drawbacks of the this approach and modify the suggested method step-by-step\nto makes the challenges of the problem more clear and provides insight into different parts of the\nsuggested algorithm.\nWe begin by de\ufb01ning a one-to-one correspondence between the elements of E and {v1, ..., vn},\nwhere n is the number of entities and vi \u2208 {0, 1}n is a vector with 1 at position i and 0 otherwise.\nWe also de\ufb01ne ABr as the adjacency matrix of the knowledge base with respect to relation Br; the\n(i, j)th elements of ABr equals to 1 when the entities corresponding to vi and vj have relation Br,\nand 0 otherwise.\n\n4.1 A Compact Differentiable Formulation\n\nTo approach this inherently discrete problem in a differentiable manner, we utilize the fact that using\nthe above notations for a pair of entities (x, y) the existence of a chain of atoms such as\n\nB1(x, z1) \u2227 B2(z1, z2) \u2227 \u00b7\u00b7\u00b7 \u2227 BT (zT \u22121 , y)\n\n(2)\n\u00b7 vy being a positive scalar. This scalar is equal to the number\nis equivalent to vT\nof paths of length T connecting x to y which traverse relation Bri at step i. It is straightforward to\nshow that for each head relation H, one can learn logical rules by \ufb01nding an appropriate \u03b1 in\n\nx \u00b7 AB1 \u00b7 AB2 \u00b7\u00b7\u00b7 ABT\n\nthat maximizes\n\nOH (\u03b1)\n\n.\n=\n\nvT\n\nx \u03c9H (\u03b1)vy,\n\n(x,H,y)\u2208KG\n\nwhere s indexes over all potential rules with maximum length of T , and ps is the ordered list of\nrelations related to the rule indexed by s.\n\n3\n\n\u03c9H (\u03b1)\n\n.\n=\n\n\u03b1s\n\nABk\n\n(cid:89)\n\nk\u2208ps\n\n(cid:88)\n(cid:88)\n\ns\n\n(3)\n\n(4)\n\n\fHowever, since the number of learnable parameters in OH (\u03b1) can be exceedingly large, i.e. O(|R|T ),\nand the number of observed pairs (x, y) which satisfy the head H are usually small, direct optimization\nof OH (\u03b1) falls in the regime of over-parameterization and cannot provide useful results. To reduce\nthe number of parameters one can rewrite \u03c9H (\u03b1) as\n\n\u2126H (a)\n\n.\n=\n\nai,kABk .\n\n(5)\n\nThis reformulation signi\ufb01cantly reduces the number of parameters to TR. However, the new\nformulation can only learn rules with \ufb01xed length T . To overcome this problem, we propose to\nmodify \u2126H (a) to\n\nT(cid:89)\n\n|R|(cid:88)\n\ni=1\n\nk=1\n\nT(cid:89)\n\ni=1\n\n|R|(cid:88)\n\n(\nk=0\n\n\u2126I\n\nH\n\n(a)\n\n.\n=\n\nai,kABk ),\n\n(6)\n\nwhere we de\ufb01ne a new relation B0 with an identity adjacency matrix AB0 = In. With this change,\nH includes all possible rule templates of length T or smaller with only T (|R| + 1)\nthe expansion of \u2126I\nfree parameters.\nAlthough \u2126I\nH considers all possible rules lengths, it is still constrained in learning the correct\nrule con\ufb01dences. As we will show in the experiments Section 5.3, this formulation (as well as\nNeural LP [41]) inevitably mines incorrect rules with high con\ufb01dences. The following theorem\nprovides insight about the restricted expressive power of the rules obtained by \u2126I\nH .\nTheorem 1. If Ro, Rs are two rules of length T obtained by optimizing the objective related to \u2126I\n,\nwith con\ufb01dence values \u03b1o, \u03b1s, then there exists (cid:96) rules of length T , R1, R2,\u00b7\u00b7\u00b7 , R(cid:96), with con\ufb01dence\nvalues \u03b11, \u03b12,\u00b7\u00b7\u00b7 , \u03b1(cid:96) such that:\n\nH\n\nd(Ro, R1) = d(R(cid:96), Rs) = 1 and d(Ro, Rs) \u2264 (cid:96) + 1,\nd(Rl, Rl+1) = 1 and \u03b1l \u2265 min(\u03b1o, \u03b1s)\n\nfor 1 \u2264 l \u2264 (cid:96),\n\nwhere d(., .) is a distance between two rules of the same size de\ufb01ned as the number of mismatched\natoms in their bodies.\n\nProof. The proof is provided in the supplementary \ufb01le.\n\nH\n\nTo further explain Theorem 1, consider an example knowledge base with only two meaningful logical\nrules of body length T = 3, i.e. Ro and Rs such that they do not share any body atoms. According to\n(a) leads to learning at least (cid:96) \u2265 2 other rules,\nTheorem 1, learning these two rules by optimizing OI\nsince d(Ro, Rs) = 3, with con\ufb01dence values greater than min(\u03b1o, \u03b1s). This means we inevitably\nlearn at least 2 additional incorrect rules with substantial con\ufb01dence values.\nTheorem 1 also entails other undesirable issues, for example the resulting list of rules may not have\nthe correct order of importance. More speci\ufb01cally, a rule might have higher con\ufb01dence value just\nbecause it is sharing an atom with another high con\ufb01dence rule. Thus con\ufb01dence values are not a\ndirect indicator of rule importance. This reduces the interpretability of the output rules.\nWe must note that all previous differentiable rule mining methods based on \u2126H (a) suffer from this\nlimitation. For example Yang et al. [41] has this limitation for rules with maximum length. Section 5.3\nillustrates these drawbacks using examples of mined rules.\n\n4.2 DRUM\nRecall that the number of con\ufb01dence values for rules of length T or smaller is (|R| + 1)T . These\nvalues can be viewed as entries of a T dimensional tensor where the size of each axis is |R| + 1. To\nbe more speci\ufb01c, we put the con\ufb01dence value of the rule with body Br1 \u2227 Br2 \u2227\u00b7\u00b7\u00b7\u2227 BrT at position\n(r1, r2, . . . , rT ) in the tensor and we call it the con\ufb01dence value tensor.\n(a) are a rank one estimation\nIt can be shown that the \ufb01nal con\ufb01dences obtained by expanding \u2126I\nof the con\ufb01dence value tensor. This interpretation makes the limitation of \u2126I\n(a) more clear and\nprovides a natural connection to the tensor estimation literature. Since a low-rank approximation (not\n\nH\n\nH\n\n4\n\n\fjust rank one) is a popular method for tensor approximation, we use it to generalize \u2126I\nrelated to rank L approximation can be formulated as\n\nH\n\n\u2126L\n\nH\n\n(a, L)\n\n.\n=\n\naj,i,kABk}.\n\nL(cid:88)\n{ T(cid:89)\n\n|R|(cid:88)\n\nj=1\n\ni=1\n\nk=0\n\n(a). The \u2126\n\n(7)\n\nIn the following theorem, we show that \u2126L\n(a, L) is powerful enough to learn any set of logical rules,\nwithout including unrelated ones.\nTheorem 2. For any set of rules R1, R2,\u00b7\u00b7\u00b7 Rr and their associated con\ufb01dence values\n\u03b11, \u03b12,\u00b7\u00b7\u00b7 , \u03b1r there exists an L\u2217, and a\u2217, such that:\n\nH\n\n(a\u2217, L\u2217) = \u03b11R1 + \u03b12R2 \u00b7\u00b7\u00b7 + \u03b1rRr.\n\n\u2126L\n\nH\n\nProof. To prove the theorem we will show that one can \ufb01nd a a\u2217 for L\u2217 = r such that the requirements\nare met. Without loss of generality, assume Rj (for some 1 \u2264 j \u2264 r) is of length t0 and consists of\nbody atoms Br1, Br2,\u00b7\u00b7\u00b7 , Brt0\n\n. By setting a\u2217\n\nj,i,k\n\n\uf8f1\uf8f2\uf8f3\u03b1j\u03b4r1 (k)\n\n\u03b4ri (k)\n\u03b40(k)\n\na\u2217\nj,i,k =\n\nif i = 1\nif 1 < i \u2264 t0\nif t0 < i\n\nit is easy to show that a\u2217 satis\ufb01es the condition in Theorem 2. Let\u2019s look at \u2126L\n\n(a\u2217, L\u2217) for each j:\n\nH\n\na\u2217\nj,i,kABk = \u03b1jABr1\n\n\u00b7 ABr2\n\n\u00b7\u00b7\u00b7 ABrt0\n\n\u00b7 I\u00b7\u00b7\u00b7 I = \u03b1jRj.\n\n|R|(cid:88)\n\nT(cid:89)\n(a\u2217, L\u2217) =(cid:80) \u03b1jRj.\n\nk=1\n\ni=1\n\nTherefore \u2126L\n\nH\n\nH is now LT (|R| + 1). However, this is just the number\nNote the number of learnable parameters in \u2126L\nof free parameters for \ufb01nding the rules for a single head relation, learning the rules for all relations in\nknowledge graph requires estimating LT (|R| + 1) \u00b7 |R| parameters, which is O(|R|2) and can be\npotentially large. Also, the main problem that we haven\u2019t addressed yet, is that direct optimization of\nthe objective related to \u2126L\nH learns parameters of rules for different head relations separately, therefore\nlearning one rule can not help in learning others.\nBefore we explain how RNNs can solve this problem, we would like to draw your attention to the\nfact that some pairs of relations cannot be followed by each other, or have a very low probability\nof appearing together. Consider the family knowledge base, where the entities are people and the\nrelations are familial ties like fatherOf, AuntOf, wifeOf, etc. If a node in the knowledge graph\nis fatherOf, it cannot be wife_of another node because it has to be male. Therefore the relation\nwife_of never follows the relation father_of. This kind of information can be useful in estimating\nlogical rules for different head relations and can be shared among them.\nTo incorporate this observation in our model and to alleviate the mentioned problems, we use L\nbidirectional RNNs to estimate aj,i,k in equation 7:\n\ni\n\n, h\n\n(cid:48)(j)\nT\u2212i+1 = BiRNNj(eH , h(j)\n\nh(j)\n[aj,i,1,\u00b7\u00b7\u00b7 , aj,i,|R|+1] = f\u03b8([ h(j)\n\n(cid:48)(j)\nT\u2212i),\nT\u2212i+1]),\n\ni\u22121, h\n, h(cid:48)(j)\n\ni\n\n(9)\nwhere h and h(cid:48) are the hidden-states of the forward and backward path RNNs, respectively, both\nof which are zero initialized. The subindexes of the hidden states denote their time step, and their\nsuperindexes identify their bidirectional RNN. eH is the embedding of the head relation H for which\nwe want to learn a probabilistic logic rule, and f\u03b8 is a fully connected neural network that generates\nthe coef\ufb01cients from the hidden states of the RNNs.\nWe use a bidirectional RNN instead of a normal RNN because it is capable of capturing information\nabout both backward and forward order of which the atoms can appear in the rule. In addition, sharing\nthe same set of recurrent networks for all head predicates (for all \u2126L\nH) allows information to be shared\nfrom one head predicate to another.\n\n(8)\n\n5\n\n\f5 Experiments\n\nIn this section we evaluate DRUM on statistical relation learning and knowledge base completion.\nWe also empirically assess the quality and interpretability of the learned rules.\nWe implement our method in TensorFlow [1] and train on Tesla K40 GPUs. We use ADAM [19] with\nlearning rate and batch size of 0.001 and 64, respectively. We set both the hidden state dimension and\nhead relation vector size to 128. We did gradient clipping for training the RNNs and used LSTMs\n[17] for both directions. f\u03b8 is a single layer fully connected. We followed the convention in the\nexisting literature [41] of splitting the data into three categories of facts, train, and test. The code and\nthe datasets for all the experiments will be publicly available.\n\n5.1 Statistical Relation Learning\n\nTable 1: Dataset statistics for statistical relation\nlearning\n\nDatasets: Our experiments were conducted\non three different datasets [20]. The Uni\ufb01ed\nMedical Language System (UMLS) consists\nof biomedical concepts such as drug and dis-\nease names and relations between them such\nas diagnosis and treatment. Kinship contains\nkinship relationships among members of a Cen-\ntral Australian native tribe. The Family data set\ncontains the bloodline relationships between individuals of multiple families. Statistics about each\ndata set are shown in Table 1.\n\nFamily\nUMLS\nKinship\n\n28356\n5960\n9587\n\n#Triplets\n\n#Relations\n\n#Entities\n\n12\n46\n25\n\n3007\n135\n104\n\nWe compared DRUM to its state of the art differentiable rule mining alternative, Neural LP [41]. To\nshow the importance of having a rank greater than one in DRUM, we test two versions, DRUM-1 and\nDRUM-4, with L = 1 and L = 4 (rank 4), respectively.\nTo the best of our knowledge, NeuralLP and DRUM are the only scalable 1 and differentiable\nmethods that provide reasoning on KBs without the need to use embeddings of the entities at test\ntime, and provide prediction solely based on the logical rules. Other methods like NTPs [25, 30]\nand MINERVA [8], rely on some type of learned embeddings at training and test time. Since rules\nare interpretable and embeddings are not, this puts our method and NeuralLP in fully-interpretable\ncategory while others do not have this advantage (therefore its not fair to directly compare them with\neach other). Moreover, methods that rely on embeddings (fully or partially) are prone to having worse\nresults in inductive tasks, as partially shown in the experiment section. Nonetheless we show the\nresults of the other methods in the appendix.\n\nTable 2: Experiment results with maximum rule length 2 and 3\n\nFamily\n\nHits@\n\nUMLS\n\nHits@\n\nKinship\n\nHits@\n\nT = 2\n\nT = 3\n\nNeural-LP\nDRUM-1\nDRUM-4\nNeural-LP\nDRUM-1\nDRUM-4\n\nMRR 10\n.99\n.91\n1.0\n.92\n1.0\n.94\n.99\n.88\n.99\n.91\n.95\n.99\n\n3\n.96\n.98\n.99\n.95\n.96\n.98\n\n1 MRR 10\n.92\n.86\n.97\n.86\n.98\n.89\n.93\n.80\n.96\n.85\n.91\n.97\n\n.75\n.80\n.81\n.72\n.77\n.80\n\n3\n.86\n.93\n.94\n.84\n.92\n.92\n\n1 MRR 10\n.91\n.62\n.85\n.66\n.67\n.92\n.89\n.58\n.88\n.63\n.66\n.91\n\n.62\n.51\n.60\n.61\n.57\n.61\n\n3\n.69\n.59\n.69\n.68\n.66\n.71\n\n1\n.48\n.34\n.44\n.46\n.43\n.46\n\nTable 2 shows link prediction results for each dataset in two scenarios with maximum rule length\ntwo and three. The results demonstrate that DRUM empirically outperforms Neural-LP in both cases\nT = 2, 3. Moreover it illustrates the importance of having a rank higher than one in estimating\ncon\ufb01dence values. We can see a more than seven percent improvement on some metrics for UMLS,\nand meaningful improvements in all other datasets. We believe DRUM\u2019s performance over Neural LP\n\n1e.g., On the Kinship dataset DRUM takes 1.2 minutes to run vs +8 hours for NTP(-\u03bb) [30] on the same\n\nmachine.\n\n6\n\n\fis due to its high rank approximation of rule con\ufb01dences and its use of bidirectional LSTM to capture\nforward and backward ordering criteria governing the body relations according to the ontology.\n\n5.2 Knowledge Graph Completion\n\nWe evaluate our proposed model in inductive and transductive link prediction tasks on two widely\nused knowledge graphs WordNet [18, 24] and Freebase [3]. WordNet is a knowledge base constructed\nto produce an intuitively usable dictionary, and Freebase is a growing knowledge base of general facts.\nIn the experiment we use WN18RR [10], a subset of WordNet, and FB15K-237 [36], which both are\nmore challenging versions of WN18 and FB15K [4] respectively. The statistics of these knowledge\nbases are summarized in Table 3. We also present our results on WN18 [4] in the appendix.\nFor transductive link prediction we compare DRUM\nto several state-of-the-art models, including Dist-\nMult [40], ComplEx [37], Gaifman [27], TransE [4],\nConvE [10], and most importantly Neural-LP. Since\nNTP(-\u03bb) [30] are not scalable to WN18 or FB15K,\nwe could not present results on larger datasets. Also\ndILP [14], unlike our method requires negative ex-\namples which is hard to obtain under Open World\nAssumption (OWA) of modern KGs and dILP is\nmemory-expensive as authors admit, which cannot\nscale to the size of large KGs, thus we can not com-\npare numerical results here.\nIn this experiment for DRUM we set the rank of the estimator L = 3 for both datasets. The results\nare reported without any hyperparamter tuning. To train the model, we split the training \ufb01le into facts\nand new training \ufb01le with the ratio of three to one. Following the evaluation method in Bordes et al.\n[4], we use \ufb01ltered ranking; table 4 summarizes our results.\n\nWN18RR FB15K-237\n86,835\n3,034\n3,134\n11\n40,943\n\nTable 3: Datasets statistics for Knowledge\nbase completion.\n\n#Train\n#Valid\n#Test\n#Relation\n#Entity\n\n272,155\n17,535\n20,466\n237\n14,541\n\nTable 4: Transductive link prediction results. The results are taken from [21, 41] and [35]\n\nWN18RR\nHits\n\nFB15K-237\nHits\n\nR-GCN [31]\nDistMult [40]\nConvE [10]\nComplEx [37]\nTuckER [2]\nComplEx-N3 [21]\nRotatE [35]\nNeural LP [41]\nMINERVA [8]\nMulti-Hop [23]\nDRUM (T=2)\nDRUM (T=3)\n\n\u2013\n.43\n.43\n.44\n.470\n.47\n.476\n.435\n.448\n.472\n.435\n.486\n\n\u2013\n49\n.52\n.51\n.526\n.54\n.571\n.566\n.513\n.542\n.568\n.586\n\nMRR @10 @3 @1\n\u2013\n.39\n.40\n.41\n.443\n\n\u2013\n.44\n.44\n.46\n.482\n\n\u2013\n\n.492\n.434\n.456\n\n\u2013\n\n.435\n.513\n\n\u2013\n\n.428\n.371\n.413\n.437\n.370\n.425\n\n.258\n.263\n.356\n.275\n.394\n\nMRR @10 @3 @1\n.153\n.248\n.241\n.155\n.237\n.325\n.158\n.247\n.358\n.266\n.35\n.338\n.24\n.293\n.393\n.250\n.343\n\n.417\n.419\n.501\n.428\n.544\n.54\n.533\n.362\n.456\n.544\n.373\n.516\n\n.217\n.329\n.187\n.255\n\n.271\n.378\n\n.329\n\n\u2013\n\n\u2013\n\n\u2013\n\n.375\n\n.241\n\n\u2013\n\n\u2013\n\nThe results clearly show DRUM empirically outperforms Neural-LP for all metrics on both datasets.\nDRUM also achieves state of the art Hit@1, Hit@3 as well as MRR on WN18RR among all methods\n(including the embedding based ones).\nIt is important to note that comparing DRUM with embedding based methods solely on accuracy\nis not a fair comparison, because unlike DRUM they are black-box models that do not provide\ninterpretability. Also, as we will demonstrate next, embedding based methods are not capable of\nreasoning on previously unseen entities.\nFor the inductive link prediction experiment, the set of entities in the test and train \ufb01le need to be\ndisjoint. To force that condition, after randomly selecting a subset of test tuples to be the new test \ufb01le,\n\n7\n\n\fTable 5: Inductive link prediction Hits@10\nmetrics.\n\nTable 6: Human assessment of number of\nconsecutive correct rules\n\nTransE\nNeural LP\nDRUM\n\nWN18\n0.01\n94.49\n95.21\n\nFB15K-237\n\n0.53\n27.97\n29.13\n\nT=2\nfather\nsister\nuncle\n\nNeural LP DRUM\n\n2\n3\n6\n\n5\n10\n6\n\nwe omit any tuples from the training \ufb01le with the entity in the new test \ufb01le. Table 5 summarizes the\ninductive results for Hits@10.\nIt is reasonable to expect a signi\ufb01cant drop in the performance of the embedding based methods in\nthe inductive setup. The result of Table 5 clearly shows that fact for the TransE method. The table\nalso demonstrates the superiority of DRUM to Neural LP in the inductive regime. Also for Hits@1\nand Hits@3, the results of DRUM are about 1 percent better than NeuralLP and for the TransE all the\nvalues are very close to zero.\n\n5.3 Quality and Interpretability of the Rules\n\nAs stated in Section 1, an important advantage of rules as a reasoning mechanism is their comprehen-\nsibility by humans. To evaluate the quality and interpretability of rules mined by DRUM we perform\ntwo experiments. Throughout this section we use the family dataset for demonstration purposes as it\nis more tangible. Other datasets like umls yield similar results.\nWe use human annotation to quantitatively assess rule quality of DRUM and Neural LP. For each\nsystem and each head predicate, we ask two blind annotators2 to examine each system\u2019s sorted list of\nrules. The annotators were instructed to identify the \ufb01rst rule they perceive as erroneous. Table 6\ndepicts the number of correct rules before the system generates an erroneous rule.\nThe results of this experiment demonstrate that rules mined by DRUM appear to be better sorted and\nare perceived to be more accurate.\nWe also sort the rules generated by each system based on their assigned con\ufb01dences and show the\nthree top rules3 in Table 7. Logically incorrect rules are highlighted by italic-red. This experiment\nshows two of the three top ranked rules generated by Neural LP are incorrect (for both head predicates\nwif e and son).\nThese errors are inevitable because it can be shown that for rules of maximum length (T ), the\nestimator of Neural LP provides a rank one estimator for the con\ufb01dence value tensor described in\nSection 4.2. Thus according to Theorem 1 the second highest con\ufb01dence rule generated by Neural\n\nLP has to share a body atom with the \ufb01rst rule. For example the rule brother(B,A) (cid:1) son(B,A),\n\neven though incorrect, has a high con\ufb01dence due to sharing the body atom brother with the highest\ncon\ufb01dence rule (\ufb01rst rule). Since DRUM does not have this limitation it can be seen that the same\ndoes not happen for rules mined by DRUM.\n\nNeural LP\n\nDRUM\n\nbrother(B, A)(cid:0) sister(A, B)\nbrother(C, A)(cid:0) sister(A, B), sister(B, C)\nbrother(C, A)(cid:0) brother(A, B), sister(B, C)\nbrother(C, A)(cid:0) nephew(A, B), uncle(B, C)\nbrother(C, A)(cid:0) nephew(A, B), nephew(C, B)\nbrother(C, A)(cid:0) brother(A, B), sister(B, C)\n\nTable 7: Top 3 rules obtained by each system learned on family dataset\nson(C, A)(cid:0) son(B, A), brother(C, B)\nson(B, A)(cid:0) brother(B, A)\nson(C, A)(cid:0) son(B, A), mother(B, C)\nson(C, A)(cid:0) nephew(A, B), brother(B, C)\nson(C, A)(cid:0) brother(A, B), mother(C, B)\nson(C, A)(cid:0) brother(A, B), daughter(B, C)\n\nwife(C, A)(cid:0) husband(A, B), husband(B, C)\nwife(B, A)(cid:0) husband(A, B)\nwife(C, A)(cid:0) daughter(B, A), husband(B, C)\nwife(A, B)(cid:0) husband(B, A)\nwife(C, A)(cid:0) mother(A, B), father(C, B)\nwife(C, A)(cid:0) son(B, A), father(C, B)\n\n6 Conclusion\n\nWe present DRUM, a fully differentiable rule mining algorithm which can be used for inductive and\ninterpretable link prediction. We provide intuition about each part of the algorithm and demonstrate\nits empirical success for a variety of tasks and benchmark datasets.\n\n2Two CS students. The annotators are not aware which system produced the rules.\n3A complete list of top 10 rules is available in the supplementary materials.\n\n8\n\n\fDRUM\u2019s objective function is based on the Open World Assumption of KBs and is trained using only\npositive examples. As a possible future work we would like to modify DRUM to take advantage\nof negative sampling. Negative sampling has shown empirical success in representation learning\nmethods and it may also be useful here. Another direction for future work would be to investigate an\nadequate way of combining differential rule mining with representation learning techniques.\n\nAcknowledgments\n\nWe thank Kazem Shirani for his valuable feedback. We thank Anthony Colas and Sourav Dutta for\ntheir help in human assessment of the rules. This work is partially supported by NSF under IIS Award\n#1526753 and DARPA under Award #FA8750-18-2-0014 (AIDA/GAIA).\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-\nfowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray,\nC. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,\nV. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and\nX. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL\nhttps://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n[2] I. Bala\u017eevi\u00b4c, C. Allen, and T. M. Hospedales. Tucker: Tensor factorization for knowledge graph\n\ncompletion. arXiv preprint arXiv:1901.09590, 2019.\n\n[3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created\ngraph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD\ninternational conference on Management of data, pages 1247\u20131250. AcM, 2008.\n\n[4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings\nfor modeling multi-relational data. In Advances in neural information processing systems, pages\n2787\u20132795, 2013.\n\n[5] Y. Chen, S. Goldberg, D. Z. Wang, and S. S. Johri. Ontological path\ufb01nding: Mining \ufb01rst-order\nknowledge from large knowledge bases. In Proceedings of the 2016 International Conference\non Management of Data, pages 835\u2013846. ACM, 2016.\n\n[6] W. W. Cohen. Tensorlog: A differentiable deductive database. arXiv preprint arXiv:1605.06523,\n\n2016.\n\n[7] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and\nA. McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge\nbases using reinforcement learning. arXiv preprint arXiv:1711.05851, 2017.\n\n[8] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and\nA. McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases\nusing reinforcement learning. International Conference on Learning Representations., 2018.\n\n[9] L. De Raedt, A. Dries, I. Thon, G. Van den Broeck, and M. Verbeke. Inducing probabilistic\nrelational rules from probabilistic examples. In Twenty-Fourth International Joint Conference\non Arti\ufb01cial Intelligence, 2015.\n\n[10] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d knowledge graph\n\nembeddings. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[11] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and\nW. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 601\u2013610. ACM, 2014.\n\n[12] T. Ebisu and R. Ichise. Toruse: Knowledge graph embedding on a lie group. In Thirty-Second\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n9\n\n\f[13] J. Ellis, J. Getman, D. Fore, N. Kuster, Z. Song, A. Bies, and S. M. Strassel. Overview of\nlinguistic resources for the tac kbp 2015 evaluations: Methodologies and results. In TAC, 2015.\n\n[14] R. Evans and E. Grefenstette. Learning explanatory rules from noisy data. Journal of Arti\ufb01cial\n\nIntelligence Research, 61:1\u201364, 2018.\n\n[15] L. Gal\u00e1rraga, C. Te\ufb02ioudi, K. Hose, and F. M. Suchanek. Fast rule mining in ontological\nknowledge bases with amieplus. The VLDB Journal\u2014The International Journal on Very Large\nData Bases, 24(6):707\u2013730, 2015.\n\n[16] K. Guu, J. Miller, and P. Liang. Traversing knowledge graphs in vector space. arXiv preprint\n\narXiv:1506.01094, 2015.\n\n[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[18] A. Kilgarriff. Wordnet: An electronic lexical database, 2000.\n\n[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th international\n\nconference on Machine learning, pages 433\u2013440. ACM, 2007.\n\n[21] T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for knowledge base\n\ncompletion. arXiv preprint arXiv:1806.07297, 2018.\n\n[22] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale\nknowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing, pages 529\u2013539. Association for Computational Linguistics, 2011.\n\n[23] X. V. Lin, R. Socher, and C. Xiong. Multi-hop knowledge graph reasoning with reward shaping.\nIn Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,\npages 3243\u20133253, 2018.\n\n[24] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):\n\n39\u201341, 1995.\n\n[25] P. Minervini, M. Bosnjak, T. Rockt\u00e4schel, and S. Riedel. Towards neural theorem proving at\n\nscale. arXiv preprint arXiv:1807.08204, 2018.\n\n[26] S. Muggleton. Inverse entailment and progol. New generation computing, pages 245\u2013286, 1995.\n\n[27] M. Niepert. Discriminative gaifman models. In Proceedings of the 30th International Conference\non Neural Information Processing Systems, NIPS\u201916, pages 3413\u20133421, USA, 2016. Curran\nAssociates Inc. ISBN 978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=\n3157382.3157479.\n\n[28] P. G. Omran, K. Wang, and Z. Wang. Scalable rule learning via learning representation. In\n\nIJCAI, pages 2149\u20132155, 2018.\n\n[29] J. R. Quinlan. Learning logical de\ufb01nitions from relations. Machine learning, pages 239\u2013266,\n\n1990.\n\n[30] T. Rockt\u00e4schel and S. Riedel. End-to-end differentiable proving.\n\nInformation Processing Systems, pages 3788\u20133800, 2017.\n\nIn Advances in Neural\n\n[31] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling\nrelational data with graph convolutional networks. In European Semantic Web Conference,\npages 593\u2013607. Springer, 2018.\n\n[32] P. Sch\u00fcller and M. Benz. Best-effort inductive logic programming via \ufb01ne-grained cost-based\n\nhypothesis generation. Machine Learning, 107(7):1141\u20131169, 2018.\n\n10\n\n\f[33] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for\nknowledge base completion. In Advances in neural information processing systems, pages\n926\u2013934, 2013.\n\n[34] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge.\n\nIn\nProceedings of the 16th international conference on World Wide Web, pages 697\u2013706. ACM,\n2007.\n\n[35] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang. Rotate: Knowledge graph embedding by relational\nrotation in complex space. In International Conference on Learning Representations, 2019.\nURL https://openreview.net/forum?id=HkgEQnRqYQ.\n\n[36] K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text\ninference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their\nCompositionality, pages 57\u201366, 2015.\n\n[37] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard. Complex embeddings for\nsimple link prediction. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The\n33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine\nLearning Research, pages 2071\u20132080, New York, New York, USA, 20\u201322 Jun 2016. PMLR.\nURL http://proceedings.mlr.press/v48/trouillon16.html.\n\n[38] D. Vrande\u02c7ci\u00b4c and M. Kr\u00f6tzsch. Wikidata: a free collaborative knowledgebase. Communications\n\nof the ACM, 57(10):78\u201385, 2014.\n\n[39] W. Y. Wang, K. Mazaitis, and W. W. Cohen. Structure learning via parameter learning. In\nProceedings of the 23rd ACM International Conference on Conference on Information and\nKnowledge Management, pages 1199\u20131208. ACM, 2014.\n\n[40] B. Yang, W. tau Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning\n\nand inference in knowledge bases. CoRR, abs/1412.6575, 2015.\n\n[41] F. Yang, Z. Yang, and W. W. Cohen. Differentiable learning of logical rules for knowledge base\n\nreasoning. In Advances in Neural Information Processing Systems, pages 2319\u20132328, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8836, "authors": [{"given_name": "Ali", "family_name": "Sadeghian", "institution": "University of Florida"}, {"given_name": "Mohammadreza", "family_name": "Armandpour", "institution": "Texas A&M University"}, {"given_name": "Patrick", "family_name": "Ding", "institution": "Texas A&M University"}, {"given_name": "Daisy Zhe", "family_name": "Wang", "institution": "Univeresity of Florida"}]}