{"title": "DeepMath - Deep Sequence Models for Premise Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 2235, "page_last": 2243, "abstract": "We study the effectiveness of neural sequence models for premise selection in automated theorem proving, a key bottleneck for progress in formalized mathematics. We propose a two stage approach for this task that yields good results for the premise selection task on the Mizar corpus while avoiding the hand-engineered features of existing state-of-the-art models. To our knowledge, this is the first time deep learning has been applied theorem proving on a large scale.", "full_text": "DeepMath - Deep Sequence Models for Premise\n\nSelection\n\nAlexander A. Alemi \u2217\n\nGoogle Inc.\n\nalemi@google.com\n\nFran\u00e7ois Chollet \u2217\n\nGoogle Inc.\n\nfchollet@google.com\n\nNiklas Een \u2217\nGoogle Inc.\n\neen@google.com\n\nGeoffrey Irving \u2217\n\nGoogle Inc.\n\nChristian Szegedy \u2217\n\nGoogle Inc.\n\ngeoffreyi@google.com\n\nszegedy@google.com\n\nJosef Urban \u2217\u2020\n\nCzech Technical University in Prague\n\njosef.urban@gmail.com\n\nAbstract\n\nWe study the effectiveness of neural sequence models for premise selection in\nautomated theorem proving, one of the main bottlenecks in the formalization of\nmathematics. We propose a two stage approach for this task that yields good\nresults for the premise selection task on the Mizar corpus while avoiding the hand-\nengineered features of existing state-of-the-art models. To our knowledge, this is\nthe \ufb01rst time deep learning has been applied to theorem proving on a large scale.\n\n1\n\nIntroduction\n\nMathematics underpins all scienti\ufb01c disciplines. Machine learning itself rests on measure and\nprobability theory, calculus, linear algebra, functional analysis, and information theory. Complex\nmathematics underlies computer chips, transit systems, communication systems, and \ufb01nancial infras-\ntructure \u2013 thus the correctness of many of these systems can be reduced to mathematical proofs.\nUnfortunately, these correctness proofs are often impractical to produce without automation, and\npresent-day computers have only limited ability to assist humans in developing mathematical proofs\nand formally verifying human proofs. There are two main bottlenecks: (1) lack of automated methods\nfor semantic or formal parsing of informal mathematical texts (autoformalization), and (2) lack of\nstrong automated reasoning methods to \ufb01ll in the gaps in already formalized human-written proofs.\nThe two bottlenecks are related. Strong automated reasoning can act as a semantic \ufb01lter for autoformal-\nization, and successful autoformalization would provide a large corpus of computer-understandable\nfacts, proofs, and theory developments. Such a corpus would serve as both background knowledge to\n\ufb01ll in gaps in human-level proofs and as a training set to guide automated reasoning. Such guidance\nis crucial: exhaustive deductive reasoning tools such as today\u2019s resolution/superposition automated\ntheorem provers (ATPs) quickly hit combinatorial explosion, and are unusable when reasoning with a\nvery large number of facts without careful selection [4].\nIn this work, we focus on the latter bottleneck. We develop deep neural networks that learn from a\nlarge repository of manually formalized computer-understandable proofs. We learn the task that is\nessential for making today\u2019s ATPs usable over large formal corpora: the selection of a limited number\nof most relevant facts for proving a new conjecture. This is known as premise selection.\nThe main contributions of this work are:\n\n\u2217Authors listed alphabetically. All contributions are considered equal.\n\u2020Supported by ERC Consolidator grant nr. 649043 AI4REASON.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fhybrid models) and their effect on premise selection performance.\n\nscale automated logical reasoning without the need for hand-engineered features.\n\n\u2022 A demonstration for the \ufb01rst time that neural network models are useful for aiding in large\n\u2022 The comparison of various network architectures (including convolutional, recurrent and\n\u2022 A method of semantic-aware \u201cde\ufb01nition\u201d-embeddings for function symbols that improves\nthe generalization of formulas with symbols occurring infrequently. This model outperforms\nprevious approaches.\n\u2022 Analysis showing that neural network based premise selection methods are complementary\nto those with hand-engineered features: ensembling with previous results produce superior\nresults.\n\n2 Formalization and Theorem Proving\n\nIn the last two decades, large corpora of complex mathematical knowledge have been formalized:\nencoded in complete detail so that computers can fully understand the semantics of complicated\nmathematical objects. The process of writing such formal and veri\ufb01able theorems, de\ufb01nitions, proofs,\nand theories is called Interactive Theorem Proving (ITP).\nThe ITP \ufb01eld dates back to 1960s [16] and the Automath system by N.G. de Bruijn [9]. ITP systems\ninclude HOL (Light) [15], Isabelle [37], Mizar [13], Coq [7], and ACL2 [23]. The development of\nITP has been intertwined with the development of its cousin \ufb01eld of Automated Theorem Proving\n(ATP) [31], where proofs of conjectures are attempted fully automatically. Unlike ATP systems,\nITP systems allow human-assisted formalization and proving of theorems that are often beyond the\ncapabilities of the fully automated systems.\nLarge ITP libraries include the Mizar Mathematical Library (MML) with over 50,000 lemmas, and\nthe core Isabelle, HOL, Coq, and ACL2 libraries with thousands of lemmas. These core libraries are a\nbasis for large projects in formalized mathematics and software and hardware veri\ufb01cation. Examples\nin mathematics include the HOL Light proof of the Kepler conjecture (Flyspeck project) [14], the\nCoq proofs of the Feit-Thompson theorem [12] and Four Color theorem [11], and the veri\ufb01cation of\nmost of the Compendium of Continuous Lattices in Mizar [2]. ITP veri\ufb01cation of the seL4 kernel [25]\nand CompCert compiler [27] show comparable progress in large scale software veri\ufb01cation. While\nthese large projects mark a coming of age of formalization, ITP remains labor-intensive. For example,\nFlyspeck took about 20 person-years, with twice as much for Feit-Thompson. Behind this cost are\nour two bottlenecks: lack of tools for autoformalization and strong proof automation.\nRecently the \ufb01eld of Automated Reasoning in Large Theories (ARLT) [35] has developed, including\nAI/ATP/ITP (AITP) systems called hammers that assist ITP formalization [4]. Hammers analyze\nthe full set of theorems and proofs in the ITP libraries, estimate the relevance of each theorem, and\napply optimized translations from the ITP logic to simpler ATP formalism. Then they attack new\nconjectures using the most promising combinations of existing theorems and ATP search strategies.\nRecent evaluations have proved 40% of all Mizar and Flyspeck theorems fully automatically [20, 21].\nHowever, there is signi\ufb01cant room for improvement: with perfect premise selection (a perfect choice\nof library facts) ATPs can prove at least 56% of Mizar and Flyspeck instead of today\u2019s 40% [4]. In\nthe next section we explain the premise selection task and the experimental setting for measuring\nsuch improvements.\n\n3 Premise Selection, Experimental Setting and Previous Results\n\nGiven a formal corpus of facts and proofs expressed in an ATP-compatible format, our task is\nDe\ufb01nition (Premise selection problem). Given a large set of premises P, an ATP system A with\ngiven resource limits, and a new conjecture C, predict those premises from P that will most likely\nlead to an automatically constructed proof of C by A.\n\nWe use the Mizar Mathematical Library (MML) version 4.181.11473 as the formal corpus and E\nprover [32] version 1.9 as the underlying ATP system. The following list exempli\ufb01es a small non-\n\n3ftp://mizar.uwb.edu.pl/pub/system/i386-linux/mizar-7.13.01_4.181.\n\n1147-i386-linux.tar\n\n2\n\n\f:: t99_jordan: Jordan curve theorem in Mizar\nfor C being Simple_closed_curve holds C is Jordan;\n\n:: Translation to first order logic\nfof(t99_jordan, axiom,\nk1_zfmisc_1(u1_struct_0(k15_euclid(2)))))\n\n(! [A] : ( (v1_topreal2(A) & m1_subset_1(A,\n\n=> v1_jordan1(A)) ) ).\n\nFigure 1: (top) The \ufb01nal statement of the Mizar formalization of the Jordan curve theorem. (bottom) The\ntranslation to \ufb01rst-order logic, using name mangling to ensure uniqueness across the entire corpus.\n\n(a) Length in chars.\n\n(b) Length in words.\n\n(c) Word occurrences.\n\n(d) Dependencies.\n\nFigure 2: Histograms of statement lengths, occurrences of each word, and statement dependencies in the\nMizar corpus translated to \ufb01rst order logic. The wide length distribution poses dif\ufb01culties for RNN models and\nbatching, and many rarely occurring words make it important to take de\ufb01nitions of words into account.\n\nrepresentative sample of topics and theorems that are included in the Mizar Mathematical Library:\nCauchy-Riemann Differential Equations of Complex Functions, Characterization and Existence of\nGr\u00f6bner Bases, Maximum Network Flow Algorithm by Ford and Fulkerson, G\u00f6del\u2019s Completeness\nTheorem, Brouwer Fixed Point Theorem, Arrow\u2019s Impossibility Theorem Borsuk-Ulam Theorem,\nDickson\u2019s Lemma, Sylow Theorems, Hahn Banach Theorem, The Law of Quadratic Reciprocity,\nPepin\u2019s Primality Test for Public-Key Cryptography, Ramsey\u2019s Theorem.\nThis version of MML was used for the latest AITP evaluation reported in [21]. There are 57,917\nproved Mizar theorems and unnamed top-level lemmas in this MML organized into 1,147 articles.\nThis set is chronologically ordered by the order of articles in MML and by the order of theorems in\nthe articles. Proofs of later theorems can only refer to earlier theorems. This ordering also applies\nto 88,783 other Mizar formulas (encoding the type system and other automation known to Mizar)\nused in the problems. The formulas have been translated into \ufb01rst-order logic formulas by the MPTP\nsystem [34] (see Figure 1).\nOur goal is to automatically prove as many theorems as possible, using at each step all previous\ntheorems and proofs. We can learn from both human proofs and ATP proofs, but previous experi-\nments [26, 20] show that learning only from the ATP proofs is preferable to including human proofs\nif the set of ATP proofs is suf\ufb01ciently large. Since for 32,524 (56.2%) of the 57,917 theorems an ATP\nproof was previously found by a combination of manual and learning-based premise selection [21],\nwe use only these ATP proofs for training.\nThe 40% success rate from [21] used a portfolio of 14 AITP methods using different learners, ATPs,\nand numbers of premises. The best single method proved 27.3% of the theorems. Only fast and\nsimple learners such as k-nearest-neighbors, naive Bayes, and their ensembles were used, based on\nhand-crafted features such as the set of (normalized) sub-terms and symbols in each formula.\n\n4 Motivation for the use of Deep Learning\n\nStrong premise selection requires models capable of reasoning over mathematical statements, here\nencoded as variable-length strings of \ufb01rst-order logic. In natural language processing, deep neural net-\nworks have proven useful in language modeling [28], text classi\ufb01cation [8], sentence pair scoring [3],\nconversation modeling [36], and question answering [33]. These results have demonstrated the ability\nof deep networks to extract useful representations from sequential inputs without hand-tuned feature\nengineering. Neural networks can also mimic some higher-level reasoning on simple algorithmic\ntasks [38, 18].\n\n3\n\n\fFigure 3: (left) Our network structure. The input sequences are either character-level (section 5.1) or word-level\n(section 5.2). We use separate models to embed conjecture and axiom, and a logistic layer to predict whether the\naxiom is useful for proving the conjecture. (right) A convolutional model.\nThe Mizar data set is also an interesting case study in neural network sequence tasks, as it differs\nfrom natural language problems in several ways. It is highly structured with a simple context free\ngrammar \u2013 the interesting task occurs only after parsing. The distribution of lengths is wide, ranging\nfrom 5 to 84,299 characters with mean 304.5, and from 2 to 21,251 tokens with mean 107.4 (see\nFigure 2). Fully recurrent models would have to back-propagate through 100s to 1000s of characters\nor 100s of tokens to embed a whole statement. Finally, there are many rare words \u2013 60.3% of the\nwords occur fewer than 10 times \u2013 motivating the de\ufb01nition-aware embeddings in section 5.2.\n\n5 Overview of our approach\n\nThe full premise selection task takes a conjecture and a set of axioms and chooses a subset of\naxioms to pass to the ATP. We simplify from subset selection to pairwise relevance by predicting the\nprobability that a given axiom is useful for proving a given conjecture. This approach depends on a\nrelatively sparse dependency graph. Our general architecture is shown in Figure 3(left): the conjecture\nand axiom sequences are separately embedded into \ufb01xed length real vectors, then concatenated and\npassed to a third network with two fully connected layers and logistic loss. During training time, the\ntwo embedding networks and the joined predictor path are trained jointly.\nAs discussed in section 3, we train our models on premise selection data generated by a combination\nof various methods, including k-nearest-neighbor search on hand-engineered similarity metrics. We\nstart with a \ufb01rst stage of character-level models, and then build second and later stages of word-level\nmodels on top of the results of earlier stages.\n\n5.1 Stage 1: Character-level models\n\nWe begin by avoiding special purpose engineering by treating formulas on the character-level using\nan 80 dimensional one-hot encoding of the character sequence. These sequences are passed to a\nweight shared network for variable length input. For the embedding computation, we have explored\nthe following architectures:\n\n1. Pure recurrent LSTM [17] and GRU [6] networks.\n2. A pure multi-layer convolutional network with various numbers of convolutional layers (with strides)\n\nfollowed by a global temporal max-pooling reduction (see Figure 3(right)).\n\n3. A recurrent-convolutional network, that uses convolutional layers to produce a shorter sequence which\n\nis processed by a LSTM.\n\nThe exact architectures used are speci\ufb01ed in the experimental section.\nIt is computationally prohibitive to compute a large number of (conjecture, axiom) pairs due to the\ncostly embedding phase. Fortunately, our architecture allows caching the embeddings for conjectures\nand axioms and evaluating the shared portion of the network for a given pair. This makes it practical\nto consider all pairs during evaluation.\n\n5.2 Stage 2: Word-level models\n\nThe character-level models are limited to word and structure similarity within the axiom or conjecture\nbeing embedded. However, many of the symbols occurring in a formula are de\ufb01ned by formulas\n\n4\n\nAxiom first order logic sequence CNN/RNN Sequence modelConjecture first order logic sequence CNN/RNN Sequence modelConcatenate embeddingsFully connected layer with 1024 outputsFully connected layer with 1 outputLogistic loss![A,B]:(gta...Wx+bWx+bWx+bWx+bWx+bUx+cUx+cUx+cMaximum\fearlier in the corpus, and we can use the axiom-embeddings of those symbols to improve model\nperformance.\nSince Mizar is based on \ufb01rst-order set theory, de\ufb01nitions of symbols can be either explicit or implicit.\nAn explicit de\ufb01nition of x sets x = e for some expression e, while an implicit de\ufb01nition states a\nproperty of the de\ufb01ned object, such as de\ufb01ning a function f (x) by \u2200x.f (f (x)) = g(x). To avoid\nmanually encoding the structure of implicit de\ufb01nitions, we embed the entire statement de\ufb01ning a\nsymbol f, and then use the stage 1 axiom-embedding corresponding to the whole statement as a\nword-level embeddings.\nIdeally, we would train a single network that embeds statements by recursively expanding and\nembedding the de\ufb01nitions of the de\ufb01ned symbols. Unfortunately, this recursion would dramatically\nincrease the cost of training since the de\ufb01nition chains can be quite deep. For example, Mizar de\ufb01nes\nreal numbers in terms of non-negative reals, which are de\ufb01ned as Dedekind cuts of non-negative\nrationals, which are de\ufb01ned as ratios of naturals, etc. As an inexpensive alternative, we reuse the\naxiom embeddings computed by a previously trained character-level model, mapping each de\ufb01ned\nsymbol to the axiom embedding of its de\ufb01ning statement. Other tokens such as brackets and operators\nare mapped to \ufb01xed pseudo-random vectors of the same dimension.\nSince we embed one token at a time ignoring the grammatical structure, our approach does not require\na parser: a trivial lexer is implemented in a few lines of Python. With word-level embeddings, we use\nthe same architectures with shorter input sequence to produce axiom and conjecture embeddings for\nranking the (conjecture, axiom) pairs. Iterating this approach by using the resulting, stronger axiom\nembeddings as word embeddings multiple times for additional stages did not yield measurable gains.\n\n6 Experiments\n\n6.1 Experimental Setup\n\nFor training and evaluation we use a subset of 32,524 out of 57,917 theorems that are known to\nbe provable by an ATP given the right set of premises. We split off a random 10% of these (3,124\nstatements) for testing and validation. Also, we held out 400 statements from the 3,124 for monitoring\ntraining progress, as well as for model and checkpoint selection. Final evaluation was done on the\nremaining 2,724 conjectures. Note that we only held out conjectures, but we trained on all statements\nas axioms. This is comparable to our k-NN baseline which is also trained on all statements as axioms.\nThe randomized selection of the training and testing sets may also lead to learning from future proofs:\na proof Pj of theorem Tj written after theorem Ti may guide the premise selection for Ti. However,\nprevious k-NN experiments show similar performance between a full 10-fold cross-validation and\nincremental evaluation as long as chronologically preceding formulas participate in proofs of only\nlater theorems.\n\n6.2 Metrics\n\nFor each conjecture, our models output a ranking of possible premises. Our primary metric is the\nnumber of conjectures proved from the top-k premises, where k = 16, 32, . . . , 1024. This metric can\naccommodate alternative proofs but is computationally expensive. Therefore we additionally measure\nthe ranking quality using the average maximum relative rank of the testing premise set. Formally,\naverage max relative rank is\n\nwhere C ranges over conjectures, Pavail(C) is the set of premises available to prove C, Ptest(C) is the\nset of premises for conjecture C from the test set, and rank(P,Pavail(C)) is the rank of premise P\namong the set Pavail(C) according to the model. The motivation for aMRR is that conjectures are\neasier to prove if all their dependencies occur early in the ranking.\nSince it is too expensive to rank all axioms for a conjecture during continuous evaluation, we\napproximate our objective. For our holdout set of 400 conjectures, we select all true dependencies\nPtest(C) and 128 \ufb01xed random false dependencies from Pavail(C)\u2212Ptest(C) and compute the average\nmax relative rank in this ordering. Note that aMRR is nonzero even if all true dependencies are\nordered before false dependencies; the best possible value is 0.051.\n\naMRR = mean\n\nrank(P,Pavail(C))\n\n|Pavail(C)|\n\nmax\n\nP\u2208Ptest(C)\n\nC\n\n5\n\n\fFigure 4: Speci\ufb01cation of the different embedder networks.\n\n6.3 Network Architectures\n\nAll our neural network models use the general architecture from Fig 3: a classi\ufb01er on top of the\nconcatenated embeddings of an axiom and a conjecture. The same classi\ufb01er architecture was used for\nall models: a fully-connected neural network with one hidden layer of size 1024. For each model, the\naxiom and conjecture embedding networks have the same architecture without sharing weights. The\ndetails of the embedding networks are shown in Fig 4.\n\n6.4 Network Training\n\nThe neural networks were trained using asynchronous distributed stochastic gradient descent using\nthe Adam optimizer [24] with up to 20 parallel NVIDIA K-80 GPU workers per model. We used the\nTensorFlow framework [1] and the Keras library [5]. The weights were initialized using [10]. Polyak\naveraging with 0.9999 decay was used for producing the evaluation weights [30]. The character\nlevel models were trained with maximum sequence length 2048 characters, where the word-level\n(and de\ufb01nition embedding) based models had a maximum sequence length of 500 words. For good\nperformance, especially for low cutoff thresholds, it was critical to employ negative mining during\ntraining. A side process was continuously evaluating many (conjecture, axiom) pairs. For each\nconjecture, we pick the lowest scoring statements that have higher score than the lowest scoring true\npositive. A queue of previously mined negatives is maintained for producing a mixture of examples\nin which the ratio of mined instances is about 25% and the rest are randomly selected premises.\nNegative mining was crucial for good quality: at the top-16 cutoff, the number of proved theorems\non the test set has doubled. For the union of proof attempts over all cutoff thresholds, the ratio of\nsuccessful proofs has increased from 61.3% to 66.4% for the best neural model.\n\n6.5 Experimental Results\n\nOur best selection pipeline uses a stage-1 character-level convolutional neural network model to\nproduce word-level embeddings for the second stage. The baseline uses distance-weighted k-\nNN [19, 21] with handcrafted semantic features [22]. For all conjectures in our holdout set, we\nconsider all the chronologically preceding statements (lemmas, de\ufb01nitions and axioms) as premise\n\n6\n\n\f(a) Training accuracy for different character-level\nmodels without hard negative mining. Recurrent\nmodels seem underperform, while pure convolutional\nmodels yield the best results. For each architecture,\nwe trained three models with different random initial-\nization seeds. Only the best runs are shown on this\ngraph; we did not see much variance between runs\non the same architecture.\n\n(b) Test average max relative rank for different mod-\nels without hard negative mining. The best is a\nword-level CNN using de\ufb01nition embeddings from\na character-level 2-layer CNN. An identical word-\nembedding model with random starting embedding\nover\ufb01ts after only 250,000 iterations and underper-\nforms the best character-level model.\n\ncandidates. In the DeepMath case, premises were ordered by their logistic scores. E prover was\napplied to the top-k of the premise-candidates for each of the cutoffs k \u2208 (16, 32, . . . , 1024) until a\nproof is found or k = 1024 fails. Table 1 reports the number of theorems proved with a cutoff value\nat most the k in the leftmost column. For E prover, we used auto strategy with a soft time limit of 90\nseconds, a hard time limit of 120 seconds, a memory limit of 4 GB, and a processed clauses limit of\n500,000.\nOur most successful models employ simple convolutional networks followed by max pooling (as\nopposed to recurrent networks like LSTM/GRU), and the two stage de\ufb01nition-based def-CNN\noutperforms the na\u00efve word-CNN word embedding signi\ufb01cantly. In the latter the word embeddings\nwere learned in a single pass; in the former they are \ufb01xed from the stage-1 character-level model. For\neach architecture (cf. Figure 4) two convolutional layers perform best. Although our models differ\nsigni\ufb01cantly from each other, they differ even more from the k-NN baseline based on hand-crafted\nfeatures. The right column of Table 1 shows the result if we average the prediction score of the stage-1\nmodel with that of the de\ufb01nition based stage-2 model. We also experimented with character-based\nRNN models using shorter sequences: these lagged behind our long-sequence CNN models but\nperformed signi\ufb01cantly better than those RNNs trained on longer sequences. This suggest that RNNs\ncould be improved by more sophisticated optimization techniques such as curriculum learning.\n\nCutoff\n16\n32\n64\n128\n256\n512\n1024\n\nk-NN Baseline (%)\n\n674 (24.6)\n1081 (39.4)\n1399 (51)\n1612 (58.8)\n1709 (62.3)\n1762 (64.3)\n1786 (65.1)\n\nchar-CNN (%) word-CNN (%)\n\n687 (25.1)\n1028 (37.5)\n1295 (47.2)\n1534 (55.9)\n1656 (60.4)\n1711 (62.4)\n1762 (64.3)\n\n709 (25.9)\n1063 (38.8)\n1355 (49.4)\n1552 (56.6)\n1635 (59.6)\n1712 (62.4)\n1755 (64)\n\ndef-CNN-LSTM (%)\n\n644 (23.5)\n924 (33.7)\n1196 (43.6)\n1401 (51.1)\n1519 (55.4)\n1593 (58.1)\n1647 (60.1)\n\ndef-CNN (%)\n\n734 (26.8)\n1093 (39.9)\n1381 (50.4)\n1617 (59)\n1708 (62.3)\n1780 (64.9)\n1822 (66.4)\n\ndef+char-CNN (%)\n\n835 (30.5)\n1218 (44.4)\n1470 (53.6)\n1695 (61.8)\n1780 (64.9)\n1830 (66.7)\n1862 (67.9)\n\nTable 1: Results of ATP premise selection experiments with hard negative mining on a test set of 2,742 theorems.\nEach entry is the number (%) of theorems proved by E prover using that particular model to rank the premises.\nThe union of def-CNN and char-CNN proves 69.8% of the test set, while the union of the def-CNN and k-NN\nproves 74.25%. This means that the neural network predictions are more complementary to the k-NN predictions\nthan to other neural models. The union of all methods proves 2218 theorems (80.9%) and just the neural models\nprove 2151 (78.4%).\n\n7 Conclusions\n\nIn this work we provide evidence that even simple neural models can compete with hand-engineered\nfeatures for premise selection, helping to \ufb01nd many new proofs. This translates to real gains in\n\n7\n\n\fModel Test min average relative rank\n\nchar-CNN\nword-CNN\ndef-CNN-LSTM\ndef-CNN\n\n0.0585\n0.06\n0.0605\n0.0575\n\n(a) Jaccard similarities between proved sets of con-\njectures across models. Each of the neural network\nmodel prediction are more like each other than those\nof the k-NN baseline.\n\n(b) Best sustained test results obtained by the above\nmodels. Lower values are better. This was moni-\ntored continuously during training on a holdout set\nwith 400 theorems, using all true positive premises\nand 128 randomly selected negatives. In this setup,\nthe lowest attainable max average relative rank with\nperfect predictions is 0.051.\n\nautomatic theorem proving. Despite these encouraging results, our models are relatively shallow\nnetworks with inherent limitations to representational power and are incapable of capturing high level\nproperties of mathematical statements. We believe theorem proving is a challenging and important\ndomain for deep learning methods, and that more sophisticated optimization techniques and training\nmethodologies will prove more useful than in less structured domains.\n\n8 Acknowledgments\n\nWe would like to thank Cezary Kaliszyk for providing us with an improved baseline model. Also\nmany thanks go to the Google Brain team for their generous help with the training infrastructure. We\nwould like to thank Quoc Le for useful discussions on the topic and to Sergio Guadarrama for his\nhelp with TensorFlow-slim.\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,\nM. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,\nB. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. War-\nden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on\nheterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[2] G. Bancerek and P. Rudnicki. A Compendium of Continuous Lattices in MIZAR. J. Autom. Reasoning,\n\n29(3-4):189\u2013224, 2002.\n\n[3] P. Baudi\u0161, J. Pichl, T. Vysko\u02c7cil, and J. \u0160ediv\u00fd. Sentence pair scoring: Towards uni\ufb01ed framework for text\n\ncomprehension. arXiv preprint arXiv:1603.06127, 2016.\n\n[4] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban. Hammering towards QED. J. Formalized\n\nReasoning, 9(1):101\u2013148, 2016.\n\n[5] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. arXiv preprint\n\narXiv:1502.02367, 2015.\n\n[7] The Coq Proof Assistant. http://coq.inria.fr.\n[8] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing\n\nSystems, pages 3061\u20133069, 2015.\n\n[9] N. de Bruijn. The mathematical language AUTOMATH, its usage, and some of its extensions. In M. Laudet,\neditor, Proceedings of the Symposium on Automatic Demonstration, pages 29\u201361, Versailles, France, Dec.\n1968. Springer-Verlag LNM 125.\n\n[10] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nInternational conference on arti\ufb01cial intelligence and statistics, pages 249\u2013256, 2010.\n\n[11] G. Gonthier. The four colour theorem: Engineering of a formal proof. In D. Kapur, editor, Computer\nMathematics, 8th Asian Symposium, ASCM 2007, Singapore, December 15-17, 2007. Revised and Invited\nPapers, volume 5081 of Lecture Notes in Computer Science, page 333. Springer, 2007.\n\n8\n\n\f[12] G. Gonthier, A. Asperti, J. Avigad, Y. Bertot, C. Cohen, F. Garillot, S. L. Roux, A. Mahboubi, R. O\u2019Connor,\nS. O. Biha, I. Pasca, L. Rideau, A. Solovyev, E. Tassi, and L. Th\u00e9ry. A machine-checked proof of the Odd\nOrder Theorem. In S. Blazy, C. Paulin-Mohring, and D. Pichardie, editors, ITP, volume 7998 of LNCS,\npages 163\u2013179. Springer, 2013.\n\n[13] A. Grabowski, A. Korni\u0142owicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Reasoning,\n\n3(2):153\u2013245, 2010.\n\n[14] T. C. Hales, M. Adams, G. Bauer, D. T. Dang, J. Harrison, T. L. Hoang, C. Kaliszyk, V. Magron,\nS. McLaughlin, T. T. Nguyen, T. Q. Nguyen, T. Nipkow, S. Obua, J. Pleso, J. Rute, A. Solovyev, A. H. T.\nTa, T. N. Tran, D. T. Trieu, J. Urban, K. K. Vu, and R. Zumkeller. A formal proof of the Kepler conjecture.\nCoRR, abs/1501.02155, 2015.\n\n[15] J. Harrison. HOL Light: A tutorial introduction. In M. K. Srivas and A. J. Camilleri, editors, FMCAD,\n\nvolume 1166 of LNCS, pages 265\u2013269. Springer, 1996.\n\n[16] J. Harrison, J. Urban, and F. Wiedijk. History of interactive theorem proving. In J. H. Siekmann, editor,\nComputational Logic, volume 9 of Handbook of the History of Logic, pages 135 \u2013 214. North-Holland,\n2014.\n\n[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[18] \u0141. Kaiser and I. Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.\n[19] C. Kaliszyk and J. Urban. Stronger automation for Flyspeck by feature weighting and strategy evolution.\nIn J. C. Blanchette and J. Urban, editors, PxTP 2013, volume 14 of EPiC Series, pages 87\u201395. EasyChair,\n2013.\n\n[20] C. Kaliszyk and J. Urban. Learning-assisted automated reasoning with Flyspeck. J. Autom. Reasoning,\n\n53(2):173\u2013213, 2014.\n\n[21] C. Kaliszyk and J. Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245\u2013256, 2015.\n[22] C. Kaliszyk, J. Urban, and J. Vyskocil. Ef\ufb01cient semantic features for automated reasoning over large\ntheories. In Q. Yang and M. Wooldridge, editors, Proceedings of the Twenty-Fourth International Joint\nConference on Arti\ufb01cial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages\n3084\u20133090. AAAI Press, 2015.\n\n[23] M. Kaufmann and J. S. Moore. An ACL2 tutorial. In Mohamed et al. [29], pages 17\u201321.\n[24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[25] G. Klein, J. Andronick, K. Elphinstone, G. Heiser, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt,\nR. Kolanski, M. Norrish, T. Sewell, H. Tuch, and S. Winwood. seL4: formal veri\ufb01cation of an operating-\nsystem kernel. Commun. ACM, 53(6):107\u2013115, 2010.\n\n[26] D. Kuehlwein and J. Urban. Learning from multiple proofs: First experiments. In P. Fontaine, R. A.\nSchmidt, and S. Schulz, editors, PAAR-2012, volume 21 of EPiC Series, pages 82\u201394. EasyChair, 2013.\n\n[27] X. Leroy. Formal veri\ufb01cation of a realistic compiler. Commun. ACM, 52(7):107\u2013115, 2009.\n[28] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based\n\nlanguage model. In INTERSPEECH, volume 2, page 3, 2010.\n\n[29] O. A. Mohamed, C. A. Mu\u00f1oz, and S. Tahar, editors. Theorem Proving in Higher Order Logics, 21st\nInternational Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume\n5170 of LNCS. Springer, 2008.\n\n[30] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on\n\nControl and Optimization, 30(4):838\u2013855, 1992.\n\n[31] J. A. Robinson and A. Voronkov, editors. Handbook of Automated Reasoning (in 2 volumes). Elsevier and\n\nMIT Press, 2001.\n\n[32] S. Schulz. E - A Brainiac Theorem Prover. AI Commun., 15(2-3):111\u2013126, 2002.\n[33] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in Neural Information\n\nProcessing Systems, pages 2431\u20132439, 2015.\n\n[34] J. Urban. MPTP 0.2: Design, implementation, and initial experiments. JAR, 37(1-2):21\u201343, 2006.\n[35] J. Urban and J. Vysko\u02c7cil. Theorem proving in large formal mathematics as an emerging AI \ufb01eld. In M. P.\nBonacina and M. E. Stickel, editors, Automated Reasoning and Mathematics: Essays in Memory of William\nMcCune, volume 7788 of LNAI, pages 240\u2013257. Springer, 2013.\n\n[36] O. Vinyals and Q. Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.\n[37] M. Wenzel, L. C. Paulson, and T. Nipkow. The Isabelle framework. In Mohamed et al. [29], pages 33\u201338.\n[38] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1151, "authors": [{"given_name": "Geoffrey", "family_name": "Irving", "institution": "Google"}, {"given_name": "Christian", "family_name": "Szegedy", "institution": "Google"}, {"given_name": "Alexander", "family_name": "Alemi", "institution": "Google"}, {"given_name": "Niklas", "family_name": "Een", "institution": "Google Inc."}, {"given_name": "Francois", "family_name": "Chollet", "institution": "Google, Inc"}, {"given_name": "Josef", "family_name": "Urban", "institution": "Czech Technical University in Prague"}]}