{"title": "Implicitly learning to reason in first-order logic", "book": "Advances in Neural Information Processing Systems", "page_first": 3381, "page_last": 3391, "abstract": "We consider the problem of answering queries about formulas of first-order logic based on background knowledge partially represented explicitly as other formulas, and partially represented as examples independently drawn from a fixed probability distribution. PAC semantics, introduced by Valiant, is one rigorous, general proposal for learning to reason in formal languages: although weaker than classical entailment, it allows for a powerful model theoretic framework for answering queries while requiring minimal assumptions about the form of the distribution in question. To date, however, the most significant limitation of that approach, and more generally most machine learning approaches with robustness guarantees, is that the logical language is ultimately essentially propositional, with finitely many atoms. Indeed, the theoretical findings on the learning of relational theories in such generality have been resoundingly negative. This is despite the fact that first-order logic is widely argued to be most appropriate for representing human knowledge. \nIn this work, we present a new theoretical approach to robustly learning to reason in first-order logic, and consider universally quantified clauses over a countably infinite domain. Our results exploit symmetries exhibited by constants in the language, and generalize the notion of implicit learnability to show how queries can be computed against (implicitly) learned first-order background knowledge.", "full_text": "Implicitly Learning to Reason in First-Order Logic\n\nUniversity of Edinburgh & Alan Turing Institute\n\nWashington University in St. Louis\n\nBrendan Juba\n\nbjuba@wustl.edu\n\nVaishak Belle\n\nvaishak@ed.ac.uk\n\nAbstract\n\nWe consider the problem of answering queries about formulas of \ufb01rst-order logic\nbased on background knowledge partially represented explicitly as other formulas,\nand partially represented as examples independently drawn from a \ufb01xed probabil-\nity distribution. PAC semantics, introduced by Valiant, is one rigorous, general\nproposal for learning to reason in formal languages: although weaker than classical\nentailment, it allows for a powerful model theoretic framework for answering\nqueries while requiring minimal assumptions about the form of the distribution\nin question. To date, however, the most signi\ufb01cant limitation of that approach,\nand more generally most machine learning approaches with robustness guarantees,\nis that the logical language is ultimately essentially propositional, with \ufb01nitely\nmany atoms. Indeed, the theoretical \ufb01ndings on the learning of relational theories\nin such generality have been resoundingly negative. This is despite the fact that\n\ufb01rst-order logic is widely argued to be most appropriate for representing human\nknowledge. In this work, we present a new theoretical approach to robustly learning\nto reason in \ufb01rst-order logic, and consider universally quanti\ufb01ed clauses over a\ncountably in\ufb01nite domain. Our results exploit symmetries exhibited by constants in\nthe language, and generalize the notion of implicit learnability to show how queries\ncan be computed against (implicitly) learned \ufb01rst-order background knowledge.\n\n1\n\nIntroduction\n\nThe tension between deduction and induction is perhaps the most fundamental issue in areas such as\nphilosophy, cognition and arti\ufb01cial intelligence. The deduction camp concerns itself with questions\nabout the expressiveness of formal languages for capturing knowledge about the world, together with\nproof systems for reasoning from such knowledge bases. The learning camp attempts to generalize\nfrom examples about partial descriptions about the world. In an in\ufb02uential paper, Valiant [31]\nrecognized that the challenge of learning should be integrated with deduction. In particular, he\nproposed a semantics to capture the quality possessed by the output of (probably approximately\ncorrect) PAC-learning algorithms when formulated in a logic. Although weaker than classical\nentailment, it allows for a powerful model theoretic framework for answering queries.\nFrom the standpoint of learning an expressive logical knowledge base and reasoning with it, most\nPAC results are somewhat discouraging. For example, in agnostic learning [12] where one does not\nrequire examples (drawn from an arbitrary distribution) to be fully consistent with learned sentences,\nef\ufb01cient algorithms for learning conjunctions would yield an ef\ufb01cient algorithm for PAC-learning\nDNF (also over arbitrary distributions), which current evidence suggests to be intractable [6]. Thus,\nit is not surprising that when it comes to \ufb01rst-order logic (FOL), very little work tackles the problem\nin a general manner. This is despite the fact that FOL is widely argued to be most appropriate for\nrepresenting human knowledge (e.g., [23, 26, 18]). For example, [4] consider the problem of the\nlearnability of description logics with equality constraints. While description logics are already\nrestricted fragments of FOL in only allowing unary and some binary predicates, it is shown that\nsuch a fragment cannot be tractably learned, leading to the identi\ufb01cation of syntactic restrictions for\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flearning from positive examples alone. Analogously, when it comes to the learning of logic programs\n[5], which in principle may admit in\ufb01nitely many terms, syntactic restrictions are also typical [7].\nIn this work, we present new results on learning to reason in FOL knowledge bases. In particular, we\nconsider the problem of answering queries about FOL formulas based on background knowledge\npartially represented explicitly as other formulas, and partially represented as examples independently\ndrawn from a \ufb01xed probability distribution. Our results are based on a surprising observation made in\n[11] about the advantages of eschewing the explicit construction of a hypothesis, leading to a paradigm\nof implicit learnability. Not only does it enable a form of agnostic learning while circumventing\nknown barriers, it also avoids the design of an often restrictive and arti\ufb01cial choice for representing\nhypotheses. (See, for example, [14], which is similar in spirit in allowing declarative background\nknowledge but only permits constant-width clauses.) In particular, implicit learning allows such\nlearning from partially observed examples, which is commonplace when knowledge bases and/or\nqueries address entities and relations not observed in the data used for learning.\nThat work was limited to the propositional setting, however. Here, we develop a \ufb01rst-order logical\ngeneralization. This requires us to generalize the notions of validity and entailment, and propose\nnew methods for recognizing true formulas under partial information, that capture what is implicitly\nlearned. Since reasoning in full FOL is undecidable we need to consider a fragment, but the fragment\nwe identify and are able to learn and reason with is expressive and powerful. Consider that standard\ndatabases correspond to a maximally consistent and \ufb01nite set of literals: every relevant atom is\nknown to be true and stored in the database, or known to be false, inferred by (say) negation as\nfailure. Our fragment corresponds to a consistent but in\ufb01nite set of ground clauses, not necessarily\nmaximal. To achieve the generalization, we revisit the PAC semantics and exploit symmetries\nexhibited by constants in the language. Moreover, the underlying language is general in the sense\nthat no restrictions are posed on clause length, predicate arity, and other similar technical devices\nseen in PAC results. We hope the simplicity of the framework is appealing to the readers and hope\nour results will renew interest in learnability for expressive languages with quanti\ufb01cational power.\nWe remark that our sole focus is in PAC-semantics approaches, but there are also other families of\nmethods for unifying statistical and logical representations, that fall under the banner of statistical\nrelational learning (SRL) (e.g., [13]). SRL includes widely used formalisms such as Markov Logic\nNetworks [28] and frameworks such as Inductive Logic Programming [27]. Learning strategies for\nSRL is an active area of research with numerous recent advances\u2014for example, a family of recent\nworks have adapted the techniques for training neural networks into the Inductive Logic Programming\nparadigm [3, 29, 8, 22]. Generally speaking, there are signi\ufb01cant differences to PAC-semantics\napproaches, such as in terms of the learning regime, the notion of correctness, and the underlying\nalgorithmic machinery. For example, Markov Logic Networks use approximate maximum-likelihood\nlearning strategies to capture the distribution of the data, whereas in PAC formulations, one considers\nan arbitrary unknown distribution over the data and studies the question of what formulas are learnable\nwhilst costing for the number of examples needed to be sampled from that distribution. PAC-semantics\nis distinguished in being able to provide guarantees of generalization performance and polynomial\ntime complexity with the minimal assumption of i.i.d. training examples. Of course, there is much\nto be gained by attempting to integrate these communities; see, for example, [5]. These differences\nnotwithstanding, the learning of logical theories is usually restricted to \ufb01nite-domain \ufb01rst-order logic,\nand so it is essentially propositional, and in that regard, our setting is signi\ufb01cantly more challenging.\n\n2 Logical Framework\nLanguage: We let L be a \ufb01rst-order\nlanguage with equality and relational symbols\n{P (x), . . . , Q(x1, . . . , xk), . . .}, variables {x, y, z, . . .}, and a countably in\ufb01nite set of rigid designa-\ntors or names, say, the set of natural numbers N, serving as the domain of discourse for quanti\ufb01cation.\nWell-de\ufb01ned formulas are constructed using logical connectives {\u00ac,\u2228,\u2200,\u2227,\u2203,\u2283}, as usual. (\u2283\ndenotes implication.) Together with equality, names essentially realize an in\ufb01nitary version of the\nunique-name assumption.1\n\n1Our language L is essentially equivalent to standard FOL together with a unique-name assumption for\n\nin\ufb01nitely many constants [17, De\ufb01nition 3].\n\nsee [9, 30], for example.\n\nIn general, the unique-name assumption does not rule out capturing uncertainty about the identity of objects;\n\n2\n\n\fThe set of (ground) atoms is obtained as:2 ATOMS = {P (a1, . . . , ak) | P is a predicate, ai \u2208 N} .\nWe sometimes refer to elements of ATOMS as propositions, and ground formulas as propositional\nformulas. We will use p, q, e to denote atoms, and \u03b1, \u03b2, \u03c6, \u03c8 to denote ground formulas.\nSemantics: A L-model M is a {0, 1} assignment to the elements of ATOMS. Using |= to denote\nsatisfaction, the semantics for \u03c6 \u2208 L is de\ufb01ned as usual inductively, but with equality as identity:\nM |= (a = b) iff a and b are the same names, and quanti\ufb01cation understood substitutionally over all\nnames in N: M |= \u2200x\u03c6(x) iff M |= \u03c6(a) for all a \u2208 N. We say that \u03c6 is valid iff for every L-model\nM, M |= \u03c6. Let the set of all models be M.\nRepresentation: Like in standard FOL, reasoning over the full fragment of L is undecidable.\nInterestingly, owing to a \ufb01xed, albeit countably in\ufb01nite, domain of discourse, the compactness\nproperty that holds for classical \ufb01rst-order logic does not hold in general [17]. For example,\n{\u2203xP (x),\u00acP (1),\u00acP (2), . . .} is an unsatis\ufb01able theory for which every \ufb01nite subset is indeed\nsatis\ufb01able. However, as identi\ufb01ed in [1], and earlier in [15], the case of disjunctive knowledge is\nmore manageable. In particular, we will be interested in learning and reasoning with incomplete\nknowledge bases with disjunctive information [1]:\nDe\ufb01nition 1: An acceptable equality is of the form x = a, where x is any variable and a any name.\nLet e range over formulas built from acceptable equalities and connectives {\u00ac,\u2228,\u2227}. Let c range\nover quanti\ufb01er-free disjunctions of (possibly non-ground) atoms. Let \u2200\u03c6 mean the universal closure\nof \u03c6, i.e., with a universal quanti\ufb01er on each free variable of \u03c6. A formula of the form \u2200(e \u2283 c) is\ncalled a \u2200-clause. A knowledge base (KB) \u2206 is proper+ if it is a \ufb01nite non-empty set of \u2200-clauses.\nThe rank of \u2206 is the maximum number of variables mentioned in any \u2200-clause in \u2206.\nThis fragment is very expressive. Consider that standard databases correspond to a maximally\nconsistent and \ufb01nite set of literals, in the sense that every relevant atom is known to be true and stored\nin the database, or known to be false, inferred by (say) negation as failure. In contrast, proper+ KBs\ncorrespond to a consistent but in\ufb01nite set of ground clauses, not necessarily maximal in this way. We\nalso note that [19] shows how to represent a certain family of \u201clocal\u201d action models for planning\nwithin the fragment of proper+ we consider, for which polynomial-time reasoning is possible.\nGrounding: A ground theory is obtained from \u2206 by substituting variables with names. Suppose \u03b8\ndenotes a substitution. We denote the result of applying \u03b8 to a formula \u03c6 by \u03c6\u03b8. For any set of names\nC \u2286 N, we write \u03b8 \u2208 C to mean substitutions are only allowed wrt the names in C. Formally, we\nde\ufb01ne:\n\nmentioned in \u2206 plus z (arbitrary) new ones;\n\n\u2022 GND(\u2206) = {c\u03b8 | \u2200(e \u2283 c) \u2208 \u2206, \u03b8 \u2208 N and |= e\u03b8};\n\u2022 For z \u2265 0, GND(\u2206, z) = {c\u03b8 | \u2200(e \u2283 c) \u2208 \u2206,|= e\u03b8, \u03b8 \u2208 Z}, where Z is the set of names\n\u2022 For C \u2286 N, GND(\u2206, C) = {c\u03b8 | \u2200(e \u2283 c) \u2208 \u2206,|= e\u03b8, \u03b8 \u2208 Z} where Z is the set of\n\u2022 GND\u2212(\u2206) = GND(\u2206, z) where z is the rank of \u2206.\n\nnames mentioned in \u2206 plus the names in C;\n\nReasoning: Unfortunately, arbitrary reasoning with such KBs is also undecidable [15, Theorem\n7]. Various proposals have appeared to consider that problem: in [15], for example, a sound but\nincomplete evaluation-based semantics is studied. In [1], it is instead shown that when the query is\nlimited to ground formulas, we can reduce \ufb01rst-order entailment to propositional satis\ufb01ability:\nTheorem 2: [1] Suppose \u2206 is a proper+ KB, and \u03b1 is a ground formula. Then, \u2206 |= \u03b1 iff\nGND\u2212(\u2206 \u2227 \u00ac\u03b1) is unsatis\ufb01able.\nHere, the RHS of the iff is a propositional formula, obtained by a \ufb01nite grounding, as de\ufb01ned above.\nExample 3: Suppose \u2206 = {\u2200x(Grad(x) \u2228 Prof(x)),\u2200x(x (cid:54)= charles \u2283 Grad(x))} and the query\nis Grad(logan). The query can be seen to be entailed. Given that the KB\u2019s rank is 1, consider the\ngrounding of the KB and the negated query wrt {charles, logan, jean} (here jean is chosen arbitrarily).\nIt is indeed unsatis\ufb01able.\nIt is worth noting that the proof here (and in other proposals with L-like languages [18, 15, 21]) is\nestablished by setting up a bijection between names to show that all names other than those that\n\n2Because equality is treated separately, atoms and clauses do not include equalities.\n\n3\n\n\fappear in the \ufb01nite grounding in the RHS behave \u201cidentically,\u201d and so for entailment purposes, it\nsuf\ufb01ces to consider a \ufb01nite set consisting of the constants already mentioned and a few extra ones.\nThat idea can be traced back to [17] (reformulated here for our purposes):\nTheorem 4: [17] Suppose \u03b1 = \u2200x\u03c6(x) is a \u2200-clause. (Its rank is 1.) Let C be the names mentioned\nin GND(\u03b1, 1). Then for every a \u2208 N, there is a b \u2208 C such that |= \u03c6(a) iff |= \u03c6(b).\nThe essence of Theorem 2 is to exploit this idea to show (reformulated here for our purposes):\nLemma 5: [1] Suppose \u03b1 is as above. If GND(\u03b1, 1) is satis\ufb01able, then so is GND(\u03b1, z) for z \u2265 1.\nThus, because Theorem 4 establishes that GND(\u03b1, 1) is satis\ufb01able if and only if \u03b1 is in the countably\nin\ufb01nite domain, and Lemma 5 establishes that the introduction of extra names in GND(\u03b1, z) preserves\nsatis\ufb01ability, we obtain satis\ufb01ability under the larger, common subset of names used in GND\u2212. These\nobservations will now lead to an appealing account for implicit learnability with proper+ KBs.\n\n3 Generalizing PAC-Semantics\n\nWe now recall the semantics we use, PAC semantics as introduced by Valiant [31]. PAC semantics\nwas formulated to capture the quality possessed by the output of PAC-learning algorithms, when\nviewed as formulas in a logic. Because inductive generalization cannot be captured by deduction,\nit inherently requires we admit the possibility of an incorrect generalization. Thus, as compared to\nclassical (Tarskian) semantics, the PAC semantics is necessarily weaker. In the classical propositional\nformulation, we suppose a propositional language with (say) n propositions, yielding a model theoretic\nspace {0, 1}n. We suppose that we observe examples independently drawn from a distribution D over\n{0, 1}n. Then, suppose further that these examples enable a learning algorithm to \ufb01nd a formula \u03c6.\nWe cannot expect this formula to be valid in the traditional sense, as PAC-learning does not guarantee\nthat the rule holds for every possible binding, only that \u03c6 so produced agrees with probability 1 \u2212 \u0001\nwrt future examples drawn from the same distribution. This motivates a weaker notion of validity:\nDe\ufb01nition 6: Given a distribution D over {0, 1}n, we say that a Boolean function F is (1 \u2212 \u0001)-valid\nif Prx\u2208D[F (x) = 1] \u2265 1 \u2212 \u0001. If \u0001 = 0, we say F is perfectly valid.\nThus far, the PAC semantics and its application to the formalization of robust logic-based learning has\nbeen limited to the propositional setting [31, 24, 11], that is, where the learning vocabulary is \ufb01nitely\nmany atoms, and the background knowledge is essentially restricted to a propositional formula.3\nGeneralizing that to the FOL case has to address, among other things, what (1 \u2212 \u0001)-validity means,\nhow FOL formulas could be learned by algorithms, and \ufb01nally, how entailments can be computed.\nThat is precisely our goal for this paper.\nWe start by proposing an extension of the PAC semantics for the in\ufb01nitary structures (generalizing\nassignments) constructed for L, namely M. For this, we will need to consider distributions on M,\nwhich are de\ufb01ned as usual [2]: we take M to be the sample space (of elementary events), de\ufb01ne a\n\u03c3-algebra M to be a set of subsets of M, which represent a collection of (not necessarily elementary)\nevents, and a function Pr : M \u2192 [0, 1], which is the probability measure.\nWe are now ready to de\ufb01ne (1 \u2212 \u0001)-validity as needed in the PAC semantics.\nDe\ufb01nition 7: Given a distribution Pr over M, we say a formula \u03c6 \u2208 L is (1 \u2212 \u0001)-valid iff\n\u03c6 \u2208 L denotes the set {M \u2208 M | M |= \u03c6} .\nIn practice, the most important use of the notion of validity is to check the entailment of a formula\nfrom a knowledge base, and by extension, the reader may wonder how that carries over from classical\nvalidity. As also observed in [11] (for the propositional case), the union bound allows classical\nreasoning to have a natural analogue in the PAC semantics, shown below. Note that, as already\nmentioned, our assumption henceforth is that knowledge bases are proper+, and queries are ground\nformulas, both in the context of reasoning as well as learning.\n\nPr((cid:74)\u03c6(cid:75)) \u2265 1 \u2212 \u0001. If \u0001 = 0, then we say that \u03c6 is perfectly valid. Here,(cid:74)\u03c6(cid:75) for any closed formula\n\n3Valiant [31] uses a fragment of FOL for which propositionalization is guaranteed to yield a small proposi-\n\ntional formula, and only considers such a reduction to the propositional case.\n\n4\n\n\fProposition 8: Let \u03c81, . . . , \u03c8k be \u2200-clauses such that each \u03c8i is (1 \u2212 \u0001i)-valid under a common\ndistribution D for some \u0001i \u2208 [0, 1]. Suppose {\u03c81, . . . , \u03c8k} |= \u03d5, for some ground formula \u03d5. Then \u03d5\n\nis (1 \u2212 \u0001(cid:48))-valid under D for \u0001(cid:48) =(cid:80)\n\ni \u0001i.\n\n4 Partial Observability\n\nThe learning problem of interest here is to obtain knowledge about the distribution D, which, of\ncourse, is not revealed directly, but in the form of a set of examples. The examples in question\nare models independently drawn from D, and we are then interested in knowing whether a query\n\u03b1 is (1 \u2212 \u0001)-valid. Intuitively, background knowledge \u2206 may be provided additionally and so the\nexamples correspond to additional knowledge that the agent learns. This additional knowledge is\nnever materialized in the form of L-formulas, but is left implicit, as postulated \ufb01rst in [11].\nWhen it comes to the examples themselves, however, we certainly cannot expect the examples to\nreveal the full nature of the world, and indeed, partial descriptions are commonplace in almost all\napplications [25]. In the case of L, moreover, providing a full description may even be impossible in\n\ufb01nite time. All of this motivates the following:\nDe\ufb01nition 9: A partial model N maps ATOMS to {1, 0,\u2217} . We say N is consistent with a L-model\nM iff for all p \u2208 ATOMS, if N [p] (cid:54)= \u2217 then N [p] = M [p]. Let N be the set of all partial models.\n\nEssentially, our knowledge of D will be obtained from a set of partial models that are the examples.\nDe\ufb01nition 10: A mask is a function \u03b8 that maps L-models to partial models, with the property that\nfor any M \u2208 M, \u03b8(M ) is consistent with M, and only a \ufb01nite number of atoms are mapped to {0, 1}.\nA masking process \u0398 is a mask-valued random variable (i.e., a random function). We denote the\ndistribution over partial models obtained by applying a masking process \u0398 to a distribution D over\nL-models by \u0398(D).4\n\nThe de\ufb01nition of masking processes allows the hiding of entries to depend on the underlying example\nfrom D. Moreover, as discussed in [11] (for the propositional case), reasoning in PAC-Semantics\nfrom complete examples is trivial, whereas the hiding of all entries by a masking process means\nthat the problem reduces to classical entailment. So, we expect examples to be of a sort that is in\nbetween these extremes. In particular, for the sake of tractable learning, we must consider formulas\nthat can be evaluated ef\ufb01ciently from the partial models with high probability. This leads to a notion\nof witnessing.\nDe\ufb01nition 11: We de\ufb01ne a propositional formula \u03c6 \u2208 L to be witnessed to evaluate to true or false\nin a partial assignment N by induction as follows:\n\n\u2022 an atom Q((cid:126)c) is witnessed to be true/false iff it is true/false respectively in N;\n\u2022 \u00ac\u03c6 is witnessed true/false iff \u03c6 is witnessed false/true respectively;\n\u2022 \u03c6 \u2228 \u03c8 is witnessed true iff either \u03c6 or \u03c8 is, and it is witnessed false iff both \u03c6 and \u03c8 are\n\u2022 \u03c6\u2227 \u03c8 is witnessed true iff both \u03c6 and \u03c8 are witnessed true, and it is witnessed false iff either\n\u2022 \u03c6 \u2283 \u03c8 is witnessed true iff either \u03c6 is witnessed false or \u03c8 is witnessed true, and it is\n\n\u03c6 or \u03c8 is witnessed false;\n\nwitnessed false;\n\nwitnessed false iff both \u03c6 is witnessed true and \u03c8 is witnessed false.\n\nWe de\ufb01ne a \u2200-clause \u2200(cid:126)x\u03c6((cid:126)x) to be witnessed true in a partial model N for the set of names C if for\nevery binding of (cid:126)x to names (cid:126)c \u2208 C, the resulting ground clause \u03c6((cid:126)c) is witnessed true in N.\nIt is the witnessing of \u2200-clauses that, in essence, enables the implicit learning of quanti\ufb01ed generaliza-\ntions. Let us see how that works. Intuitively, from examples \u03c6((cid:126)c1), . . . , one would like to generalize\nto \u2200(cid:126)x\u03c6((cid:126)x), the latter being a statement about in\ufb01nitely many objects. But what criteria would justify\nthis generalization, outside of (say) witnessing in\ufb01nitely many instances? Our result shows that,\nsurprisingly, it suf\ufb01ces to get \ufb01nitely many examples, so as to witness \u03c6((cid:126)c1), . . . , \u03c6((cid:126)ck) and yield\n\n4Note that since we assume that the resulting partial models are \ufb01nite and thus countable, as long as the\nmasking processes are measurable functions w.r.t. the joint probability measure, every event de\ufb01ned in terms of\nthe partial models is a countable union of measurable events, and thus measurable.\n\n5\n\n\funiversally quanti\ufb01ed sentences with high probability. This is possible because, via Theorem 2, all\nthe names not mentioned in the KB and the query behave identically. Thus, provided we witness the\ngrounding of \u03c6 for a suf\ufb01cient but \ufb01nite set of constants, we can treat the implicit KB as including\n\u2200-clauses, as it yields the same judgments on our queries.\nPutting it all together, formally, in any given learning epoch, let S be the class of queries we are\ninterested in asking: that is, S is any \ufb01nite set of ground formulas. Let C then be all the names\nmentioned in S, the KB, and z extra new ones chosen arbitrarily, where z is at least the rank of the\nKB. If z = KB\u2019s rank, then the rank of the implicit KB matches that of the explicit KB; otherwise, it\nwould be higher. So the de\ufb01nition says that the witnessing of \u2200(cid:126)x\u03c6((cid:126)x) happens when \u03c6((cid:126)c) is witnessed\nfor all (cid:126)c \u2208 C. We think this notion is particularly powerful, as it neither makes references to bindings\nfrom the full set of names N (which is in\ufb01nite), nor to not observing negative instances. Note also\nthat witnessing does not require observing all atoms: a clause is witnessed to evaluate to true if some\nliteral appearing in it is true in the partial model. Thus, the \u2200-clause witnessed may involve predicates\nnot explicitly appearing in the partial model.\n\nExample 12: Let \u2206 be the KB\n\n{\u2200x (cid:54)= logan \u2283 Mutant(x),\u2200x (cid:54)= y \u2283 [Mutant(x) \u2227 Teammate(x, y) \u2283 Mutant(y)]}.\n\nThen the \u2200-clause \u2200x (cid:54)= logan \u2283 [Mutant(x) \u2283 Teammate(x, logan)] is witnessed for a suitable set\nof names w.r.t. \u2206 (with rank two) in any example that mentions at least two other names (in addition\nto logan) for which the substitution into Mutant(x) \u2283 Teammate(x, logan) is satis\ufb01ed in the partial\nmodel. For instance, we may have the partial model {Teammate(scott,logan),Teammate(jean,logan)},\nor the partial model {Teammate(ororo,logan),Teammate(kurt,logan)}.\n\nWitnessed formulas correspond to the implicit KB. In order to capture the inferences that the implicit\nKB permits, we will use partial models to simplify complex formulas in the KB or query. To that end,\nwe de\ufb01ne:\n\nDe\ufb01nition 13: Given a partial model N and a propositional formula \u03c6, the restriction of \u03c6 under\nN, denoted \u03c6|N , is recursively de\ufb01ned: if \u03c6 is an atom witnessed in N, then \u03c6|N is the value that \u03c6\nis witnessed to evaluate to under N; if \u03c6 is an atom not set by N, then \u03c6|N = \u03c6; if \u03c6 = \u00ac\u03c8, then\n\u03c6|N = \u00ac(\u03c8|N ); and if \u03c6 = \u03b1 \u2227 \u03b2, then \u03c6|N = (\u03b1|N ) \u2227 (\u03b2|N ). (And analogously for Boolean\nconnectives \u2228 and \u2283 .) For a partial model N and set of propositional formulas F , we let F|N denote\nthe set {\u03c6|N : \u03c6 \u2208 F}.\nNotice that here we do not de\ufb01ne restrictions for quanti\ufb01ed formulas, such as those appearning in the\nKB: while that is possible it is not needed, as we will be leveraging Theorem 2 for reasoning.\nExample 14 : Consider GND\u2212(\u2206) for the KB \u2206 of Example 12 using the set of names\n{scott,jean,logan}. Then the restriction of the grounding of our second rule under the partial model\n{Teammate(scott,logan),Teammate(jean,logan),Teammate(scott,jean)} is\n\nMutant(scott) \u2283 Mutant(logan), [Mutant(logan) \u2227 Teammate(logan,scott) \u2283 Mutant(scott)],\nMutant(jean) \u2283 Mutant(logan), [Mutant(logan) \u2227 Teammate(logan,jean) \u2283 Mutant(jean)],\nMutant(scott) \u2283 Mutant(jean), [Mutant(jean) \u2227 Teammate(jean,scott) \u2283 Mutant(scott)].\n\nHad the partial model also included Teammate(logan,scott), Teammate(logan,jean), and\nTeammate(scott,jean) we would have had the further simpler collection\n\nMutant(scott) \u2283 Mutant(logan), Mutant(logan) \u2283 Mutant(scott),\nMutant(jean) \u2283 Mutant(logan), Mutant(logan) \u2283 Mutant(jean),\nMutant(scott) \u2283 Mutant(jean), Mutant(jean) \u2283 Mutant(scott).\n\n5\n\nImplicit Learnability\n\nThe central motivation here is learning to reason in FOL, and as argued earlier, implicit learning\ncircumvents the need for an explicit hypothesis, especially since hypothesis \ufb01tting is intractable,\nunless one severly restricts the hypothesis space. So, learning is integrated tightly into the application\n\n6\n\n\fusing the knowledge extracted from data. Our de\ufb01nitions in the previous sections establish the\ngrounds for which a \ufb01rst-order implict KB can be learned from \ufb01nitely many \ufb01nite-size examples,\nbut also the grounds for deciding propositional entailments of \u2200-clauses speci\ufb01ed explicitly \u2013 i.e., the\nbackground knowledge. (Of course, reasoning is not yet tractable, but simply decidable; we return\nto this point later). Overall, the learning regime is presented in Algorithm 1, and its correctness is\njusti\ufb01ed in Theorem 15.\n\nAlgorithm 1 Reasoning with implicit learning\n\nInput: Partial models N (1), N (2), . . . , N (m), explicit KB \u2206, query \u03b1 (a ground formula), number\nof names k at least equal to \u2206\u2019s rank\nOutput: \u02c6p \u2208 [0, 1] estimating \u03b1 is \u02c6p-valid (See Theorem 15)\nInitialize v \u2190 0\nfor i = 1, . . . , m do\n\nfor all k-tuples of names (c1, . . . , ck) from N (i) not appearing in \u2206 \u2227 \u00ac\u03b1 do\n\nif GND(\u2206 \u2227 \u00ac\u03b1,{c1, . . . , ck})|N (i) is unsatis\ufb01able then\nend if\nend for\n\nIncrement v and skip to the next i.\n\nend for\nReturn v/m\n\nTheorem 15: Let \u03b4, \u03b3 \u2208 (0, 1) and k \u2208 N be given. Suppose we have m partial models drawn i.i.d.\nfrom a common distribution D masked by a masking process \u0398, where m \u2265 1\n\u03b4 . (Here, ln\ndenotes the natural logarithm.) With probability at least 1 \u2212 \u03b4, Algorithm 1 returns a value \u02c6p s.t.\n\n2\u03b32 ln 2\n\nI if \u2206 \u2283 \u03b1 is at most p-valid, \u02c6p \u2264 p + \u03b3\nII if there is a KB I such that\n\n1. \u2206 \u2227 I |= \u03b1,\n2. the rank of \u2206 \u2227 I is at most k, and\n3. with probability at least p over partial models N \u2208 \u0398(D), there exists names c1, . . . , ck\nnot appearing in \u2206 or \u03b1, such that every formula in I is witnessed true in N for\nc1, . . . , ck together with the names appearing in \u2206 and \u03b1\n\nthen \u02c6p \u2265 p \u2212 \u03b3.\nPart I: \u02c6p \u2264 p + \u03b3 if \u2206 \u2283 \u03b1 is at most p-valid. We \ufb01rst note that when GND(\u2206 \u2227\nProof:\n\u00ac\u03b1, C)|N (i) |= \u22a5 for any set of names C, since N (i) is consistent with the actual model M (i) that\nproduced it, GND(\u2206 \u2227 \u00ac\u03b1, C)|M (i) |= \u22a5 as well. Thus, in this case, GND(\u2206 \u2227 \u00ac\u03b1, C) is falsi\ufb01ed\nby M (i). Since |C| is at least the rank of \u2206, it is easy to see that GND(\u2206 \u2227 \u00ac\u03b1), which is logically\nequivalent to \u2206 \u2227 \u00ac\u03b1, is falsi\ufb01able at M (i). So, it must be that the negation of that theory (i.e.,\n\u2206 \u2283 \u03b1) is satis\ufb01ed at M (i).\nNow, \u2206 \u2283 \u03b1 is by de\ufb01nition p-valid with respect to this distribution on M (i) if the probability that\n\u2206 \u2283 \u03b1 is satis\ufb01ed by each M (i) is p. Moreover, it follows immediately from Hoeffding\u2019s inequality\nthat for m \u2265 1\n\u03b4 , the probability that the fraction of times \u2206 \u2283 \u03b1 is satis\ufb01ed by M (i) (out of\nm) exceeds p by more than \u03b3 is at most \u03b4/2. Thus, \u02c6p, which is at most the fraction of times \u2206 \u2283 \u03b1 is\nactually satis\ufb01ed by M (i), likewise is at most p + \u03b3 with probability at least 1 \u2212 \u03b4/2.\nPart II: rate of witnessing an implicit KB lower bounds \u02c6p. Note that by the grounding trick\n(Theorem 2), \u2206 \u2227 I |= \u03b1 implies that for any set of names c1, . . . , ck not appearing in \u2206 or \u03b1,\nGND(\u2206\u2227I\u2227\u03b1,{c1, . . . , ck}) |= \u22a5. Suppose that I is witnessed true for c1, . . . , ck together with the\nnames in \u2206 and \u03b1 in N (i). We note that in the restricted formula GND(\u2206\u2227I\u2227\u00ac\u03b1,{c1, . . . , ck})|N (i),\nthe groundings of formulas in I all simplify to 1 (true), and so GND(\u2206\u2227I\u2227\u00ac\u03b1,{c1, . . . , ck})|N (i) =\nGND(\u2206\u2227\u00ac\u03b1,{c1, . . . , ck})|N (i). Thus, GND(\u2206\u2227\u00ac\u03b1,{c1, . . . , ck})|N (i) |= \u22a5, so v is incremented\non this iteration. Thus, indeed, \u02c6p = v/m is at least the fraction of times out of m that I is witnessed\ntrue for some set of k names. It again follows from Hoeffding\u2019s inequality that for m \u2265 1\n\u03b4 , this\nis at least p \u2212 \u03b3 with probability 1 \u2212 \u03b4/2.\n\n2\u03b32 ln 2\n\n2\u03b32 ln 2\n\n7\n\n\fBy a union bound, the two parts hold simultaneously with probability at least 1 \u2212 \u03b4, as needed.\n\nIn essence, the no-overestimation condition is a soundness guarantee and the no-underestimation\ncondition is a limited completeness guarantee: in other words, if the query logically follows from the\nexplicit KB and examples then the algorithm returns success with an appropriate \u02c6p, and vice versa.\nNote that the number of examples m needed (to answer a single query) depends only on the desired\naccuracy \u03b3 and con\ufb01dence \u03b4. It is independent of the size of the KB, the number of predicates, etc.\nExample 16: Continuing Examples 12 and 14, we noted that the \u2200-clause\n\u2200x (cid:54)= logan \u2283 [Mutant(x) \u2283 Teammate(x,logan)]\n\nwas witnessed w.r.t. \u2206 for partial models such as {Teammate(scott,logan), Teammate(jean,logan)}\nor {Teammate(ororo,logan), Teammate(kurt,logan)}. This formula could serve as an implicit\nKB if \u0398(D) produces such examples; it completes a proof of Mutant(logan) by \ufb01rst inferring\nMutant(x) for some x (cid:54)= logan from the \ufb01rst rule of \u2206, using this implicit KB formula to infer\nTeammate(x,logan), and \ufb01nally using the second rule of \u2206 to infer Mutant(logan). In these partial\nmodels, respectively, the restricted grounding of \u2206 correspondingly produces Mutant(scott) and\nMutant(scott) \u2283 Mutant(logan), or Mutant(ororo) and Mutant(ororo) \u2283 Mutant(logan), which in\neach case allows us to prove the query Mutant(logan), via a different individual depending on the\nnames mentioned in the partial example. Observe that \u2206 does not allow us to infer that the Teammate\nrelation holds for any individuals, whereas the data alone, which only gives positive examples of\nthe Teammate relation, is not adequate to infer the Mutant relation. We need both to establish\nMutant(logan).\n\n6 Tractable Reasoning\n\nAlgorithm 1 reduces reasoning with implicit learning to deciding entailment. In order to obtain a\ntractable algorithm, we generally need to restrict the reasoning task somehow. One approach, taken in\nthe previous work on propositional implicit learning [11], is to \u201cpromise\u201d that the query is provable in\nsome low-complexity fragment; for example, it is provable by a small treelike resolution proof (where\n\u201csmall\u201d refers to the number of lines of the proof). Equivalently, we give up on completeness, and\nonly seek completeness with respect to conclusions provable in low complexity in a given fragment.\nIn general, then, one obtains a running time guarantee that is parameterized by the size of the proof\nof the query. We can take a similar approach here, by using an algorithm for deciding entailment that\nis ef\ufb01cient when parameterized in such terms. In general, what is needed is a fragment for which\nwe can decide the existence of proofs ef\ufb01ciently, and that is \u201crestriction-closed,\u201d meaning that for\nany partial model N, if we consider the restriction of each line of the proof, we obtain a proof in the\nsame fragment. Most fragments we might consider, including speci\ufb01cally treelike or bounded-width\nresolution, are restriction-closed. (See [10] for details.)\nWe will motivate an entirely new strategy here, which offers a semantic perspective to the proof-\ntheoretic view in [11]. One classically sound model-theoretic approach to constraining propositional\nreasoning is to limit the power of the reasoner, as represented, for example, by the work on tautological\nentailment [16]. More recently, [20] suggest a simple evaluation scheme for proper+ KBs that\ngradually increases the power of the reasoner: level 0 is standard database lookup together with unit\npropagation, level 1 allows for one case split in a clause, level 2 allows two case splits, and so on. The\nformal intuition is as follows: suppose s is a set of ground clauses and \u03c6 is a ground query, and let us\nsay its a clause for simplicity. Let U(s) denote the the closure of s under unit propagation, de\ufb01ned as\nthe least set s(cid:48) satisfying: (a) s \u2286 s(cid:48) and (b) if literal l \u2208 s(cid:48) and (\u00acl \u2228 c) \u2208 s(cid:48) then c \u2208 s(cid:48). Then let\nV(s) de\ufb01ne all possible weakenings: {c | c is a ground clause and there is a c(cid:48) \u2208 U(s) s.t. c(cid:48) \u2286 c} .\nThen we de\ufb01ne s |=z \u03c6 (read: \u201centails at levels z\") iff one of the following holds:\n\n\u2022 subsume: z = 0, and \u03c6 \u2208 V(s);\n\u2022 split: z > 0 and there is some clause c \u2208 s such that for all literals l \u2208 c, s \u222a {l} |=(z\u22121) \u03c6.\n\nFor small values of z, entailment at level z is tractable to decide as well as sound:\nTheorem 17: [20] Suppose \u2206, \u03c6 are propositional formulas and z \u2208 N. Then, determining if\n\u2206 |=z \u03c6 can be done in time O((|\u03c6||\u2206|)z+1). Moreover, if \u2206 |=z \u03c6 then \u2206 |= \u03c6.\n\n8\n\n\fWe will now see how to leverage these results. First, however, we need the equivalent to restriction-\nclosed, as discussed above.\nProposition 18: Suppose \u03c6, \u2206, z are as above. Then if \u2206 |=z \u03c6, and N is any partial model then\n(\u2206|N ) |=z (\u03c6|N ).\nBasically, if \u03c6 is entailed at level z from \u2206, then any restriction of \u03c6 under N must also be entailed by\n\u2206 restricted to N, at least at level z if not lower. Notice that restricting a ground formula is equivalent\n(w.r.t. satis\ufb01ability) to simply conjoining the literals true at N with that formula, from which the\nproof follows. Now, recall from Theorem 2, given a proper+ KB \u2206 and ground query \u03c6, we have\n\u2206 |= \u03c6 iff GND\u2212(\u2206 \u2227 \u00ac\u03b1) is unsatis\ufb01able. Here, since \u03b1 is already ground, we really only need\nto make sure that \u2206 is ground wrt all the names in \u2206 \u2227 \u00ac\u03b1 and k new ones, k being the rank of \u2206.\nSo let GND\u03b1(\u2206) denote precisely such a grounding of \u2206. It then follows that GND\u03b1(\u2206) |= \u03b1 iff\n\u2206 |= \u03b1. It is easy to show that the same holds for |=z as well [20]. So let Algorithm 1(cid:48) be exactly\nlike Algorithm 1 except that it takes an additional parameter z (for limited reasoning) and replaces\nthe following check:\n\nGND(\u2206 \u2227 \u00ac\u03b1,{c1, . . . , ck})|N (i) is unsatis\ufb01able\nGND(\u2206,{c1, . . . , ck, d1, . . . , dm})|N (i)\nnames appearing in \u03b1 but not in \u2206.\n\nwith\n\n|=z (\u03b1|N (i)), where {d1, . . . , dm} is the set of\n\n2\u03b32 ln 2\n\nTheorem 19: Let \u03b4, \u03b3 \u2208 (0, 1), k \u2208 N, m \u2265 1\n\u03b4 , and let z \u2208 N. Then with a probability at\nleast 1 \u2212 \u03b4, Algorithm 1(cid:48) returns a value \u02c6p such that: (I) and (II) is as in Theorem 15 except for (II.1)\nwhich states that \u2206 \u2227 I |=z \u03b1. The algorithm runs in time O((|\u03b1||GND\u03b1(\u2206)|)z+1m).\nDiscussion.\nInterestingly, in [21], it is shown that reasoning is also tractable in the \ufb01rst-order case if\nthe knowledge base and the query both use a bounded number of variables. This would then mean\nthat we would no longer be limited to ground queries and can handle queries with quanti\ufb01ers. This\ndirection is left for future research. Nonetheless, we note that deciding quanti\ufb01ed (as opposed to\nground) queries appears to demand more from learning. In general, in an in\ufb01nite domain, we cannot\nhope to observe in a \ufb01nite partial model that universally quanti\ufb01ed formulas are ever true. Thus,\nwe anticipate that extensions that handle queries with quanti\ufb01ers will need a substantially different\nframework, presumably with stronger assumptions. One possible framework takes a more credulous\napproach to the learning problem (in contrast to our skeptical approach based on witnessing truth):\nwe suppose that when a formula is frequently false on the distribution of examples, we also frequently\nobtain a partial model that witnesses the formula false\u2014e.g., a partial model in which a binding of a\ncandidate \u2200-clause falsi\ufb01es it. This is undoubtedly an assumption about the benevolent nature of the\nenvironment, captured as the notion of concealment in [25], but it does make learning conceptually\nsimpler. In this framework, one permits all conclusions that are not explicitly falsi\ufb01ed. Whether such\nan idea can be used for inductive generalization of FOL formulas over arbitrary distributions remains\nto be seen.\n\n7 Conclusions\n\nIn this work, we presented new results on the problem of answering queries about formulas of \ufb01rst-\norder logic (FOL) based on background knowledge partially represented explicitly as other formulas,\nand partially represented as examples independently drawn from a \ufb01xed probability distribution.\nBy appealing to the paradigm of implicit learnability, we sidestepped many major negative results,\nleading to a learning regime that works with a general and expressive FOL fragment. No restrictions\nwere posed on clause length, predicate arity, and other similar technical devices seen in PAC results.\nOverall, we hope the simplicity of the framework is appealing to the readers and hope our results will\nrenew interest in learnability for expressive languages with quanti\ufb01cational power.\n\nAcknowledgements\n\nV. Belle was supported by a Royal Society University Research Fellowship. B. Juba was supported\nby NSF Award CCF-1718380. This work was partially performed while B. Juba was visiting the\nSimons Institute for the Theory of Computing. We thank our reviewers for their helpful suggestions.\n\n9\n\n\fReferences\n[1] V. Belle. Open-universe weighted model counting. In AAAI, pages 3701\u20133708, 2017.\n\n[2] P. Billingsley. Probability and Measure. Wiley-Interscience, New York, NY, USA, 3rd edition,\n\n1995.\n\n[3] W. W. Cohen. Tensorlog: A differentiable deductive database. Preprint, arXiv:1605.06523,\n\n2016.\n\n[4] W. W. Cohen and H. Hirsh. The learnability of description logics with equality constraints.\n\nMachine Learning, 17(2-3):169\u2013199, 1994.\n\n[5] W. W. Cohen and C. D. Page. Polynomial learnability and inductive logic programming:\n\nMethods and results. New Generation Computing, 13(3-4):369\u2013409, 1995.\n\n[6] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning DNF\u2019s. In\n\nCOLT, pages 815\u2013830, 2016.\n\n[7] L. De Raedt and S. D\u017eeroski. First-order jk-clausal theories are PAC-learnable. Arti\ufb01cial\n\nIntelligence, 70(1):375\u2013392, 1994.\n\n[8] R. Evans and E. Grefenstette. Learning explanatory rules from noisy data. Journal of Arti\ufb01cial\n\nIntelligence Research, 61:1\u201364, 2018.\n\n[9] G. D. Giacomo, Y. Lesp\u00e9rance, and H. J. Levesque. Ef\ufb01cient reasoning in proper knowledge\n\nbases with unknown individuals. In IJCAI, pages 827\u2013832, 2011.\n\n[10] B. Juba. Learning implicitly in reasoning in PAC-semantics. Preprint, arXiv:1209.0056, 2012.\n\n[11] B. Juba. Implicit learning of common sense for reasoning. In IJCAI, pages 939\u2013946, 2013.\n\n[12] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward ef\ufb01cient agnostic learning. Machine\n\nLearning, 17(2-3):115\u2013141, 1994.\n\n[13] K. Kersting, S. Natarajan, and D. Poole. Statistical relational AI: Logic, probability and\ncomputation. In International Conference on Logic Programming and Nonmonotonic Reasoning,\npages 1\u20139, 2011.\n\n[14] R. Khardon and D. Roth. Learning to reason with a restricted view. Machine Learning,\n\n35(2):95\u2013116, 1999.\n\n[15] G. Lakemeyer and H. J. Levesque. Evaluation-based reasoning with disjunctive information in\n\n\ufb01rst-order knowledge bases. In KR, pages 73\u201381, 2002.\n\n[16] H. Levesque. A logic of implicit and explicit belief. In AAAI, pages 198\u2013202, 1984.\n\n[17] H. J. Levesque. A completeness result for reasoning with incomplete \ufb01rst-order knowledge\n\nbases. In KR, pages 14\u201323, 1998.\n\n[18] H. J. Levesque and G. Lakemeyer. The logic of knowledge bases. The MIT Press, Cambridge,\n\nMA, USA, 2001.\n\n[19] Y. Liu and G. Lakemeyer. On \ufb01rst-order de\ufb01nability and computability of progression for\n\nlocal-effect actions and beyond. In IJCAI, pages 860\u2013866, 2009.\n\n[20] Y. Liu, G. Lakemeyer, and H. J. Levesque. A logic of limited belief for reasoning with\n\ndisjunctive information. In KR, pages 587\u2013597, 2004.\n\n[21] Y. Liu and H. J. Levesque. Tractable reasoning in \ufb01rst-order knowledge bases with disjunctive\n\ninformation. In AAAI, pages 639\u2013644, 2005.\n\n[22] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt. Deepproblog: Neural\nprobabilistic logic programming. In Advances in Neural Information Processing Systems 31,\npages 3749\u20133759, 2018.\n\n10\n\n\f[23] J. McCarthy and P. J. Hayes. Some philosophical problems from the standpoint of arti\ufb01cial\n\nintelligence. In Machine Intelligence, pages 463\u2013502, 1969.\n\n[24] L. Michael. Reading between the lines. In IJCAI, pages 1525\u20131530, 2009.\n\n[25] L. Michael. Partial observability and learnability. Arti\ufb01cial Intelligence, 174(11):639\u2013669,\n\n2010.\n\n[26] R. C. Moore. The role of logic in knowledge representation and commonsense reasoning. In\n\nAAAI, pages 428\u2013433, 1982.\n\n[27] S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. The Journal\n\nof Logic Programming, 19:629\u2013679, 1994.\n\n[28] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1):107\u2013136,\n\n2006.\n\n[29] T. Rockt\u00e4schel and S. Riedel. End-to-end differentiable proving.\n\nInformation Processing Systems 30, pages 3788\u20133800, 2017.\n\nIn Advances in Neural\n\n[30] S. Srivastava, S. J. Russell, P. Ruan, and X. Cheng. First-order open-universe POMDPs. In UAI,\n\npages 742\u2013751, 2014.\n\n[31] L. G. Valiant. Robust logics. Arti\ufb01cial Intelligence, 117(2):231\u2013253, 2000.\n\n11\n\n\f", "award": [], "sourceid": 1863, "authors": [{"given_name": "Vaishak", "family_name": "Belle", "institution": "University of Edinburgh & Alan Turing Institute"}, {"given_name": "Brendan", "family_name": "Juba", "institution": "Washington University in St. Louis"}]}