Paper ID: | 1863 |
---|---|

Title: | Implicitly learning to reason in first-order logic |

Solid paper with interesting theoretical results. In a high level, the challenge the authors had to solve is: how to achieve or even state learning when there are so many sources of infinities, such as example models and elements in each model's underlying universe? Of course, the paper considers a specific instance of this general question that arises in the context of a restricted version of first-order logic and a particular class of models for the logic. But I found its formulation of the problem elegant and the solution nice. Specifically, I liked the idea of using the fact that formulas cannot distinguish certain non-explicitly named elements and cannot count in a sense more than certain syntactically determined numbers. Also, the use of partial model and random-masking process is quite clever.

This paper is generally well written and clear, albeit targeting readers with formal backgrounds. The quality of the paper seems high in terms of its formal claims. The proposed mechanism is remarkable simple, making this an attractive approach. I really like the idea behind not making learning explicit (as opposed to rule induction for example). I have three main concerns about this paper: - In general it is very close to Juba's 2012 work [1]. In that work, a propositional variant is proposed. The work here extends the approach. This is a sensible idea, and well executed as far as I can tell, but the delta is still relatively small and the change feels relatively mechanic. That's not a problem per se, but means that in terms of originality this paper isn't as convincing. The authors make it worse by using verbatim copies of some of the text in [1], such as the first paragraph of section 3. - It's not clear what the significance of this result is. The authors do go beyond propositional logic, but make little effort to motivate the specific fragment and tell us what we can now do we couldn't before. For example, could this be contrasted with explicit rule learning and could we empirically see when and how it improves over that. While I understand this a primarily theoretical contribution, I think for a venue like NeuRIPS the authors need to do a better job in positioning and motivating their work in the broader statistical relational learning context. - I found it difficult to get an intuition how "implicit learning" really operates. For example, if I see a bunch of examples where human(x) => mortal(x), but I don't have any such formula my background, how is this formula implicitly learnt? I think I understand the algorithm but I have no intuition how it achieves this. Good running examples could help. [1] https://arxiv.org/pdf/1209.0056.pdf (the authors do cite this paper and attribute it appropriately) Update: after author response and reviewer discussion I increased my score from 4 to 6. I do think this is an important and novel theoretical contribution. I would still encourage the authors to make paper more accessible in terms of exposition and contextualisation for the non PL/Logic folks in the community. I would also *strongly* encourage them to not copy and paste whole paragraphs from Juba 2012.

• I struggled to follow the paper, which in my view has several reasons - While I am familiar with statistical relational learning, I lack some of the background to digest the paper in full. The paper is not very self contained and many concepts are assumed to be known to the reader or only touched upon briefly. For example, what is a universal closure (L100), - The paper is very dense, not leaving much space for intuitive explanations or examples. My impression is that this work might better fit into a journal format. I could also imagine it would make sense to move most of the proofs to the appendix and use the space to add examples and more intuitive explanations. A good example of what I mean can be found in lines 214 to 221. The explanation there was really helpful and I would want to see more of these throughout the paper. - The notation, while I assume common in this subfield, clashes with symbols commonly used in machine learning (e.g. θ,Δ,α) - The paper is notation heavy. I understand it is a theoretical paper and space is limited, but I felt at various points the authors should have spent more time spelling things out (e.g. In Theorem 16 it would literally have been shorter to briefly restate what δ,γ,k,m are rather than referring to Theorem 13). • How does learning from entailment here is connected to learning from entailment in the inductive logic programming literature (e.g. Stephen Muggleton. Inductive logic programming. New Generation Comput., 8(4): 295–318, 1991.)? • Weaknesses of the proposed approach are not discussed. For instance, is the time complexity mentioned in Theorem 14 good or bad? To what extent does it have implications for practical application (e.g. Knowledge Bases with thousands, tens of thousands, hundreds of thousands of facts)? My main problem is gauging the significance of the presented work. I think this is in part me not able to understand the paper in detail, but also due to the fact that there is no empirical validation of the presented approach. Minor comments • L65: I would add neuro-symbolic learning to the unification of statistical and logical representation here. See for example - Cohen, William W. "Tensorlog: A differentiable deductive database." arXiv preprint arXiv:1605.06523 (2016). - Rocktäschel, Tim, and Sebastian Riedel. "End-to-end differentiable proving." Advances in Neural Information Processing Systems. 2017. - Evans, Richard, and Edward Grefenstette. "Learning explanatory rules from noisy data." Journal of Artificial Intelligence Research 61 (2018): 1-64. - Manhaeve, Robin, et al. "Deepproblog: Neural probabilistic logic programming." Advances in Neural Information Processing Systems. 2018. • L81: I am not familiar with the last logical connective. • L107: What does "maximal" mean here? • L111: What does "eθ" refer to? Applying substitutions θ to e? • L116: I am confused since as far as I understand z would be a rank and a name here? • L155-7: I think this is a good summary of the goal of the paper and should be stated earlier. • L175: Can you elaborate why summing εi is valid here? My intuition is that this only works if the clauses are independent. Maybe this is trivially the case, but it is not obvious to me. • L220: This seems to depend on the distribution of test queries though. Are you assuming the follow the same/similar distribution as test queries? UPDATE: After considering the rebuttal by the authors and discussions with the other reviewers I am happy to increase my score. That said, I strongly encourage the authors to make the paper more accessible and improve its clarity. I particularly believe more running examples and intuitive explanations would make the paper stronger.