Review for NeurIPS paper: Learning to Prove Theorems by Learning to Generate Theorems

NeurIPS 2020

Learning to Prove Theorems by Learning to Generate Theorems

Review 1

Summary and Contributions: In 2016 Whalen announced the first automated theorem prover using deep reinforcement learning techniques. The prover was based on Metamath, a math formalization corpus and a proof checker. The current work extends Whalen's work by augmentation with synthetic data, which are generated also using neural networks (a combination of deterministic methods, GAN and reinforcement learning).

Strengths: The content of this work is relevant to NIPS. Data augmentation is also a natural and proven methodology to improve machine learning performance, though it is also known that with fixed amount of source data the gain from data augmentation is marginal. Based on a reimplementation of Whalen's code so as to adapt to GPU, the authors have shown some improvements with their data augmentation approach as expected.

Weaknesses: There is not that much novelty in the paper. It would be nice if the authors can give more discussion on the limitations of their approach. And I am wondering whether at certain point of augmentation they have witnessed any degradation of performance due to overfitting. The part on Metamath not clearly written. Being not an expert on Metamath myself, I had to go back to Whalen's paper from time to time (unfortunately it is also not very clearly written) and eventually to Metamath's manual. It took me some time to realize that only the fact of tree interpretation of Metamath sentences is essential to understanding this line of works. I also think it's a bit overclaiming by saying that your work is 'orthogonal' to all previous approaches as your idea verification methodology follows largely from previous approaches, especially Whalen, Loos, etc.

Correctness: Some of the results are overstated, as the paper is not that novel, but other than this there are no complaints about the method and correctness.

Clarity: Some parts could be explained better, as discussed above.

Relation to Prior Work: It is ok.

Reproducibility: Yes

Additional Feedback: 57 -> ``MetaGen'' (correct quotes). 71 -> Reference of HOL-Light (Harrison et al.) 92 -> ...self-play, 'with' one agent... 98 -> 'compete' with each other... 184 -> 'optionally' their proof trees... 322 -> 'MetaGen-Rand' (typo) 324 -> ...due to 'the fact that' our prover runs... 334 -> ...'focus more on'... 341 -> and less 'comfortable' / 'confident' with...

Review 2

Summary and Contributions: The paper describes a 7.7% improvement of the Holophrasm system by generating synthetic theorems. The theorems are generated by guided instantiation and special care is taken not to generate instantiations that are trivially true.

Strengths: Generating relevant synthetic theorems is a nice and nontrivial research direction. It is related to tasks such as lemma introduction and theory exploration. 7.7% more proved theorems on the test set is a decent result. The authors have done a decent amount of work.

Weaknesses: On the other hand, much larger improvements have been achieved by iterating learning and reasoning and thus "synthesizing" the body of training proofs. Such proofs are also created "synthetically" in some sense - as a result of diversely parameterized proof attempts on a curriculum of relevant theorems. This goes back at least to MaLARea [10,11,12] and continues in systems like ATPBoost [13]. See also, e.g., [1,2], where the improvements are 40% and 70% also on the internal guidance tasks. Hence the most straightforward iterative approach to producing more proof data seems far from saturated at the moment. Even in the tactical setting, already the first system (TacticToe) [14] introduces the "self-learning" option of (RL-style) training on its own (i.e. "synthesized") proofs (instead or together with the human proofs). Quite a few other arguments in the paper are incorrect or exaggarated. Already the first version of MPTP and the AI/TP experiments based on it [3] allowed and announced straightforward generation of 630000 related proof tasks from Mizar. Further millions of "synthetic" tasks can be created easily by chasing the large derivation graphs of MML and other ITP libraries. This has been to some extent used later e.g. in works such as [4,5], where the benefit of learning from additional proof tasks was also demonstrated. The same holds about theorem transfer [6], hint-based guidance [7], explicit weakening in Octopus [8,9], etc. In this context, the results are not surprising, and perhaps instead a bit disappointing compared to the large improvements obtained e.g. in [1,2,11,12,13] by using the standard learning/proving loop that generates more and more theorem proving data to learn from. Conjectures have also been synthesized in many other ways, some of them neural [15-22]. Another claim that should be (much) softened is that Holophrasm is "state-of-the-art". It may be the only ATP-style experiment done so far for MetaMath, but the (pretty much universal) experience on more standard ITPs is that the performance of hammers (combined with internal guidance of ATPs, etc.) and of learning-guided tactical systems is today 40-60%. The Metamath theorems are not harder than those in MML, Flyspeck and Isabelle, so a 25-30% performance is most likely improvable by very standard methods. There may be various reasons, e.g., the forward proof style, no pruning by unification (compare with the unification/indexing-based pruning in standard ATPs, etc.) used in Holophrasm. All that said, this seems to be one of the more interesting AI/TP papers submitted to NIPS'20, so I am leaning towards acceptance, provided the incorrect claims are fixed in the final version. ++++++++++++++++++++++++ Update after reading the response and other reviews: After reading the feedback and other reviews, I tend to agree with the rest of the reviewers that the score is still below 6 and the paper may need more work. Holophrasm being state-of-art may be indeed an exaggerated claim. There is no response to the "synthesis" presented here being similar to the one in Octopus. The argument that the synthesis differs from graph chasing and conjecturing is not too convincing either. Analogy-based conjecturing often creates weakenings/strengthenings. Graph-chasing of dependencies and premise selection do it too - just by (not) including some assumptions. I am further reminded of another related work [23], which pushes this quite a bit further. [23] Julio César López-Hernández, Konstantin Korovin: An Abstraction-Refinement Framework for Reasoning with Large Theories. IJCAR 2018: 663-679 ++++++++++++++++++++++ [1] Kaliszyk, C., Urban, J., Michalewski, H. & Olsák, M. Reinforcement Learning of Theorem Proving in NeurIPS 2018 (2018), 8836–8847. [2] Jan Jakubuv, Josef Urban: Hammering Mizar by Learning Clause Guidance. ITP 2019: 34:1-34:8 [3] Urban, J. MPTP - Motivation, Implementation, First Experiments. J. Autom. Reasoning 33, 319– 339 (2004). [4] Cezary Kaliszyk, Josef Urban: Learning-assisted theorem proving with millions of lemmas. J. Symb. Comput. 69: 109-128 (2015) [5] Kaliszyk, C., Urban, J. & Vyskocil, J. Lemmatization for Stronger Reasoning in Large Theories in FroCoS 2015 9322 (Springer, 2015), 341–356. [6] Thibault Gauthier, Cezary Kaliszyk: Sharing HOL4 and HOL Light Proof Knowledge. LPAR 2015: 372-386 [7] Robert Veroff: Using Hints to Increase the Effectiveness of an Automated Reasoning Program: Case Studies. J. Autom. Reasoning 16(3): 223-239 (1996) [8] Monty Newborn, Zongyan Wang: Octopus: Combining Learning and Parallel Search. J. Autom. Reasoning 33(2): 171-218 (2004) [9] M. Newborn: Theo and Octopus at the 2006 World Championship for Automated Reasoning Programs. https://www.cs.mcgill.ca/~newborn/TheoOctopusAtCasc06.pdf [10] Urban, J. MaLARea: a Metasystem for Automated Reasoning in Large Theories in CADE-21 Work- shop on Empirically Successful Automated Reasoning in Large Theories 257 (CEUR-WS.org, 2007). [11] Urban, J., Sutcliffe, G., Pudlák, P. & Vyskocil, J. MaLARea SG1- Machine Learner for Automated Reasoning with Semantic Guidance in IJCAR 2008 5195 (Springer, 2008), 441–456. [12] Kaliszyk, C., Urban, J. & Vyskocil, J. Machine Learner for Automated Reasoning 0.4 and 0.5 in PAAR@IJCAR 2014 31 (EasyChair, 2014), 60–66. [13] Piotrowski, B. & Urban, J. ATPBoost: Learning Premise Selection in Binary Setting with ATP Feedback in IJCAR 2018 10900 (Springer, 2018), 566–574. [14] Gauthier, T., Kaliszyk, C. & Urban, J. TacticToe: Learning to Reason with HOL4 Tactics in LPAR 21 46 (EasyChair, 2017), 125–143. [15] Gauthier, T., Kaliszyk, C. & Urban, J. Initial Experiments with Statistical Conjecturing over Large Formal Corpora in (CICM 2016 - Work in Progress Proceedings 1785 (CEUR-WS.org, 2016), 219– 228. [16] Thibault Gauthier. Deep reinforcement learning in HOL4. CoRR, abs/1910.11797, 2019. [17] Karel Chvalovsky, Thibault Gauthier & Josef Urban: First Experiments with Data Driven Conjecturing, AITP'19, 2019. http://aitp-conference.org/2019/abstract/AITP_2019_paper_27.pdf [18] Josef Urban, Jan Jakubuv: First Neural Conjecturing Datasets and Experiments. CoRR abs/2005.14664 (2020) [19] Lenat, D.B.: AM: an artificial intelligence approach to discovery in mathematics as heuristic search. Ph.D thesis, Stanford (1976) [20] Fajtlowicz, S.: On conjectures of Graffiti. Ann. Discrete Math. 72(1–3), 113–118 (1988) [21] Colton, S.: Automated Theory Formation in Pure Mathematics. Distinguished Dissertations. Springer, London (2012). [22] Johansson, M., Rosén, D., Smallbone, N., Claessen, K.: Hipster: integrating theory exploration in a proof assistant. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS (LNAI), vol. 8543, pp. 108–122. Springer, Cham (2014).

Correctness: See above.

Clarity: See above.

Relation to Prior Work: See above.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper proposes an approach to augment training data of prover. Trained with the generated data, the the number of theorems proved increased from 557 to 600 as in Table 3, which is the state of the art number on this dataset.

Strengths: The strength of the paper has shown that data augmentation is useful in the field of automated theorem proving. As far as I understood, no researches in the field of ML based automated theorem proving has addressed this direction. As a result, the trained prover achieves the state of the art result on a MetaMath dataset.

Weaknesses: The weakness of the paper is the limited experimental evaluations. The evaluation data is set.mm only and the improvement in Table 3 is not large; I'm not convinced that we are just lucky on this dataset or not. The difference of the number of "test proofs found" between the authors Holophrasm and the previous Holophrasm, the number varies a lot depending configurations. Achieving state of the art is not impactful unless controlling computational cost etc.

Correctness: One approach to show the effectiveness of the proposed approach is to experiment on other Metamath theorem sets. The author proposes a constraint as in line 178. However the effectiveness is not evaluated; adding those analysis can better understanding of proposed data generation method.

Clarity: Combined with the supplementary material, the paper is overall well written. However, the detail of the substitution network is not clear. (Line 229 and 230.) Some example would be helpful.

Relation to Prior Work: The author can explicitly claim the difference or the contribution. However, the difference from the work of Whalen (2016) could have been stated more carefully.

Reproducibility: No

Additional Feedback: