{"title": "Premise Selection for Theorem Proving by Deep Graph Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 2786, "page_last": 2796, "abstract": "We propose a deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information. We then embed the graph into a vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90.3%.", "full_text": "Premise Selection for Theorem Proving\n\nby Deep Graph Embedding\n\nMingzhe Wang\u2217 Yihe Tang\u2217 Jian Wang\nUniversity of Michigan, Ann Arbor\n\nJia Deng\n\nAbstract\n\nWe propose a deep learning-based approach to the problem of premise selection:\nselecting mathematical statements relevant for proving a given conjecture. We\nrepresent a higher-order logic formula as a graph that is invariant to variable\nrenaming but still fully preserves syntactic and semantic information. We then\nembed the graph into a vector via a novel embedding method that preserves the\ninformation of edge ordering. Our approach achieves state-of-the-art results on the\nHolStep dataset, improving the classi\ufb01cation accuracy from 83% to 90.3%.\n\n1\n\nIntroduction\n\nAutomated reasoning over mathematical proofs is a core question of arti\ufb01cial intelligence that dates\nback to the early days of computer science [1]. It not only constitutes a key aspect of general\nintelligence, but also underpins a broad set of applications ranging from circuit design to compilers,\nwhere it is critical to verify the correctness of a computer system [2, 3, 4].\nA key challenge of theorem proving is premise selection [5]: selecting relevant statements that are\nuseful for proving a given conjecture. Theorem proving is essentially a search problem with the\ngoal of \ufb01nding a sequence of deductions leading from presumed facts to the given conjecture. The\nspace of this search is combinatorial\u2014with today\u2019s large mathematical knowledge bases [6, 7], the\nsearch can quickly explode beyond the capability of modern automated theorem provers, despite\nthe fact that often only a small fraction of facts in the knowledge base are relevant for proving a\ngiven conjecture. Premise selection thus plays a critical role in narrowing down the search space and\nmaking it tractable.\nPremise selection has been mainly tackled as hand-designed heuristics based on comparing and\nanalyzing symbols [8]. Recently, some machine learning methods have emerged as a promising\nalternative for premise selection, which can naturally be cast as a classi\ufb01cation or ranking problem.\nAlama et al. [9] trained a kernel-based classi\ufb01er using essentially bag-of-words features, and demon-\nstrated large improvement over the state of the art system. Alemi et al. [5] were the \ufb01rst to apply\ndeep learning approaches to premise selection and demonstrated competitive results without manual\nfeature engineering. Kaliszyk et al. [10] introduced HolStep, a large dataset of higher-order logic\nproofs, and provided baselines based on logistic regression and deep networks.\nIn this paper we propose a new deep learning approach to premise selection. The key idea of our\napproach is to represent mathematical formulas as graphs and embed them into vector space. This\nis different from prior work on premise selection that directly applies deep networks to sequences\nof characters or tokens [5, 10]. Our approach is motivated by the observation that a mathematical\nformula can be represented as a graph that encodes the syntactic and semantic structure of the formula.\nFor example, the formula \u2200x\u2203y(P (x) \u2227 Q(x, y)) can be expressed as the graph shown in Fig. 1,\nwhere edges link terms to their constituents and connect quanti\ufb01ers to their variables.\n\n\u2217Equal contribution.\n\n\fFigure 1: The formula \u2200x\u2203y(P (x) \u2227 Q(x, y)) can be represented as a graph.\n\nOur hypothesis is that such graph representations are better than sequential forms because a graph\nmakes explicit key syntactic and semantic structures such as composition, variable binding, and\nco-reference. Such an explicit representation helps the learning of invariant feature representations.\nFor example, P (x, T (f (z) + g(z), v)) \u2227 Q(y) and P (y) \u2227 Q(x) share the same top level structure\nP \u2227 Q, but such similarity would be less apparent and harder to detect from a sequence of tokens\nbecause syntactically close terms can be far apart in the sequence.\nAnother bene\ufb01t of a graph representation is that we can make it invariant to variable renaming\nwhile preserving the semantics. For example, the graph for \u2200x\u2203y(P (x) \u2227 Q(x, y) (Fig. 1) is the\nsame regardless of how the variables are named in the formula, but the semantics of quanti\ufb01ers and\nco-reference is completely preserved\u2014the quanti\ufb01er \u2200 binds a variable that is the \ufb01rst argument of\nboth P and Q, and the quanti\ufb01er \u2203 binds a variable that is the second argument of Q.\nIt is worth noting that although a sequential form encodes the same information, and a neural network\nmay well be able to learn to convert a sequence of tokens into a graph, such a neural conversion\nis unnecessary\u2014unlike parsing natural language sentences, constructing a graph out of a formula\nis straightforward and unambiguous. Thus there is no obvious bene\ufb01t to be gained through an\nend-to-end approach that starts from the textual representation of formulas.\nTo perform premise selection, we convert a formula into a graph, embed the graph into a vector,\nand then classify the relevance of the formula. To embed a graph into a vector, we assign an initial\nembedding vector for each node of the graph, and then iteratively update the embedding of each\nnode using the embeddings of its neighbors. We then pool the embeddings of all nodes to form\nthe embedding of the entire graph. The parameters of each update are learned end to end through\nbackpropagation. In other words, we learn a deep network that embeds a graph into a vector; the\ntopology of the unrolled network is determined by the input graph.\nWe perform experiments using the HolStep dataset [10], which consists of over two million conjecture-\nstatement pairs that can be used to evaluate premise selection. The results show that our graph-\nembedding approach achieves large improvement over sequence-based models. In particular, our\napproach improves the state-of-the-art accuracy on HolStep by 7.3%.\nOur main contributions of this work are twofold. First, we propose a novel approach to premise\nselection that represents formulas as graphs and embeds them into vectors. To the best our knowledge,\nthis is the \ufb01rst time premise selection is approached using deep graph embedding. Second, we\nimprove the state-of-the-art classi\ufb01cation accuracy on the HolStep dataset from 83% to 90.3%.\n\n2 Related Work\n\nResearch on automated theorem proving has a long history [11]. Decades of research has resulted in a\nvariety of well-developed automated theorem provers such as Vampire [12] and E [13]. However, no\nexisting automated provers can scale to large mathematical libraries due to combinatorial explosion\nof the search space. This limitation gave rise to the development of interactive theorem proving [11]\nsuch as Coq [14] and Isabelle [15], which combines humans and machines in theorem proving and\nhas led to impressive achievements such as the proof of the Kepler conjecture [16] and the formal\nproof of the Feit-Thompson problem [17].\nPremise selection as a machine learning problem was introduced by Alama et al. [9], who constructed\na corpus of proofs to train a kernelized classi\ufb01er using bag-of-word features that represent the\noccurrences of terms in a vocabulary. Deep learning techniques were \ufb01rst applied to premise selection\nin the DeepMath work by Alemi et al. [5], who applied recurrent networks and convolutional to\nformulas represented as textual sequences, and showed that deep learning approaches can achieve\ncompetitive results against baselines using hand-engineered features. Serving the needs for large\n\n2\n\nVARVARPQ\fdatasets for training deep models, Kaliszyk et al. [10] introduced the HolStep dataset that consists of\n2M statements and 10K conjectures, an order of magnitude larger than the DeepMath dataset [5].\nA related task to premise selection is internal guidance of ATPs [18, 19, 20, 21, 22, 23, 24], the\nselection of the next clause to process inside an automated theorem prover. Internal guidance differs\nfrom premise selection in that internal guidance depends on the logical representation, inference\nalgorithm, and current state inside a theorem prover, whereas premise selection is only about picking\nrelevant statements as the initial input to a theorem prover that is treated as a black box. Because\ninternal guidance is tightly integrated with proof search and is invoked repeatedly, ef\ufb01ciency is as\nimportant as accuracy, whereas for premise selection ef\ufb01ciency is not as critical.\nLoos et al. [25] were the \ufb01rst to apply deep networks to internal guidance of ATPs. They experimented\nwith both sequential representations and tree representations (recursive neural networks [26, 27]).\nNote that their tree representations are simply the parse trees, which, unlike our graphs, are not\ninvariant to variable renaming and do not capture how quanti\ufb01ers bind variables. Whalen [23] uses\nGRU networks to guide the exploration of partial proof trees, with formulas represented as sequences\nof tokens.\nIn addition to premise selection and internal guidance, other aspects of theorem proving have also\nbene\ufb01ted from machine learning. For example, K\u00fchlwein & Urban [28] applied kernel methods to\nstrategy \ufb01nding, the problem of searching for good parameter con\ufb01gurations for an automated prover.\nSimilarly, Bridge et al. [29] applied SVM and Gaussian Processes to select good heuristics, which\nare collections of standard settings for parameters and other decisions.\nOur graph embedding method is related to a large body of prior work on embeddings and graphs.\nDeepwalk [30], LINE [31] and Node2Vec [32] focus on learning node embeddings. Similar to\nWord2Vec [33, 34], they optimize the embedding of a node to predict nodes in a neighborhood.\nRecursive neural networks [35, 27] and Tree LSTMs [36] consider embeddings of trees, a special\ntype of graphs. Misra & Artzi [37] embed tree representations of typed lambda calculus expressions\ninto vectors, with variable nodes labeled with only their types. This leads to invariance to variable\nrenaming, but is not entirely lossless in terms of semantics. If a formula contains multiple variables\nof the same type but with different names, it is not possible to know which lambda abstraction binds\nwhich variable.\nNeural networks on general graphs were \ufb01rst introduced by Gori et al [38] and Scarselli et al [39].\nMany follow-up works [40, 41, 42, 43, 44, 45] proposed speci\ufb01c architectures to handle graph-based\ninput by extending recurrent neural network to graph data [38, 41, 42] or making use of graph\nconvolutions based on spectral graph theories [40, 43, 44, 45, 46]. Our approach is most similar to\nthe work of [40], where they encode molecular fragments as neural \ufb01ngerprints with graph-based\nconvolutions for chemical applications. But to the best of our knowledge, no previous deep learning\napproaches on general graphs preserve the order of edges. In contrast, we propose a novel way of\ngraph embedding that can preserve the information of edge ordering, and demonstrate its effectiveness\nfor premise selection.\n\n3 FormulaNet: Formulas to Graphs to Embeddings\n\n3.1 Formulas to Graphs\n\nWe consider formulas in higher-order logic [47]. A higher-order formula can be de\ufb01ned recursively\nbased on a vocabulary of constants, variables, and quanti\ufb01ers. A variable or a constant can act as a\nvalue or a function. For example, \u2200f\u2203x(f (x, c) \u2227 P (f )) is a higher-order formula where \u2200 and \u2203 are\nquanti\ufb01ers, c is a constant value, P,\u2227 are constant functions, x is a variable value, and f is both a\nvariable function and a variable value.\nTo construct a graph from a formula, we \ufb01rst parse the formula into a tree, where each internal node\nrepresents a constant function, a variable function, or a quanti\ufb01er, and each leaf node represents a\nvariable value or a constant value. We then add edges that connect a quanti\ufb01er node to all instances of\nits quanti\ufb01ed variables, after which we merge (leaf) nodes that represent the same constant or variable.\nFinally, for each occurrence of a variable, we replace its original name with VAR, or VARFUNC if it\nacts as a function. Fig. 2 illustrates these steps.\n\n3\n\n\fFigure 2: From a formula to a graph: (a) the input formula; (b) parsing the formula into a tree; (c)\nmerging leaves and connecting quanti\ufb01ers to variables; (d) renaming variables.\nFormally, let S be the set of all formulas, Cv be the set of constant values, Cf the set of constant\nfunctions, Vv the set of variable values, Vf the set of variable functions, and Q the set of quanti\ufb01ers.\nLet s be a higher-order logic formula with no free variables\u2014any free variables can be bounded\nby adding quanti\ufb01ers \u2200 to the front of the formula. The graph Gs = (Vs, Es) of formula s can be\nrecursively constructed as follows:\n\n(cid:16)\n\ni Esi \u222a {(f, \u03bd(si))}i) followed by Gs \u2190 MERGE_C(G(cid:48)\n\ns \u2190 ((cid:83)n\ni Vsi \u222a {f},(cid:83)n\nVt \u222a {f}, Et \u222a {(\u03c6, \u03bd(t))(cid:83)\n\n\u2022 if s = \u03b1, where \u03b1 \u2208 Cv \u222a Vv, then Gs \u2190 ({\u03b1},\u2205), i.e. the graph contains a single node \u03b1.\n\u2022 if s = f (s1, s2, . . . , sn), where f \u2208 Cf \u222a Vf and s1, . . . , sn \u2208 S, then we perform\nG(cid:48)\ns), where\n\u03bd(si) is the \u201chead node\u201d of si and MERGE_C is an operation that merges the same constant\n(leaf) nodes in the graph.\ns \u2190\ns ) if x \u2208\nVv \u222a Vf and Gs \u2190 RENAMEx(G(cid:48)\ns), where Vt[x] is the nodes that represent the variable x in\nthe graph of t, MERGEx is an operation that merges all nodes representing the variable x into\na single node, and RENAMEx is an operation that renames x to VAR (or VARFUNC if x acts as\na function).\n\n\u2022 if s = \u03c6xt, where \u03c6 \u2208 Q, t \u2208 S, x \u2208 Vv \u222a Vf ,\n\nv\u2208Vt[x]{(\u03c6, v)}(cid:17)\n\nthen we perform G(cid:48)(cid:48)\ns \u2190 MERGEx(G(cid:48)(cid:48)\n\n, followed by G(cid:48)\n\nBy construction, our graph is invariant to variable renaming, yet no syntactic or semantic information\nis lost. This is because for a variable node (either as a function or value), its original name in the\nformula is irrelevant in the graph\u2014the graph structure already encodes where it is syntactically and\nwhich quanti\ufb01er binds it.\n\n3.2 Graphs to Embeddings\n\nTo embed a graph to a vector, we take an approach similar to performing convolution or message\npassing on graphs [40]. The overall idea is to associate each node with an initial embedding and\niteratively update them. As shown in Fig. 3, suppose v and each node around v has an initial\nembedding. We update the embedding of v by the node embeddings in its neighborhood. After\nmulti-step updates, the embedding of v will contain information from its local strcuture. Then we\nmax-pool the node embeddings across all of nodes in the graph to form an embedding for the graph.\nTo initialize the embedding for each node, we use the one-hot vector that represents the name of the\nnode. Note that in our graph all variables have the same name VAR (or VARFUNC if the variable acts\nas a function), so their initial embeddings are the same. All other nodes (constants and quanti\ufb01ers)\neach have their names and thus their own one-hot vectors.\nWe then repeatedly update the embedding of each node using the embeddings of its neighbors. Given\na graph G = (V, E), at step t + 1 we update the embedding xt+1\n\nof node v as follows:\n\nF t\nI (xt\n\nu, xt\n\nv) +\n\nF t\nO(xt\n\nv, xt\nu)\n\n,\n\n(1)\n\n(cid:16)\n\n(cid:104) (cid:88)\n\n(u,v)\u2208E\n\nxt+1\nv = F t\nP\n\nxt\nv +\n\n1\ndv\n\nv\n\n(cid:88)\n\n(v,u)\u2208E\n\n(cid:105)(cid:17)\n\nwhere dv is the degree of node v, F t\nedges, and F t\n\nO are update functions using incoming edges and outgoing\nP is an update function to conbine the old embeddings with the new update from neighbor\n\nI and F t\n\n4\n\nxffPxxfcVARffPcxPcVARFUNC(a)(b)(c)(d)VAR\fFigure 3: An example of applying the order-preserving updates in Eqn. 2. To update node v, we\nconsider its neighbors and its position in all treelets (see Sec. 3.3) it belongs to.\n\nnodes. We parametrize these update functions as neural networks; the detailed con\ufb01gurations will be\ngiven in Sec. 4.2.\nIt is worth noting that all node embeddings are updated in parallel using the same update functions,\nbut the update functions can be different across steps to allow more \ufb02exibility. Repeated updates allow\neach embedding to incorporate information from a bigger neighborhood and thus capture more global\nstructures. Interestingly, with zero updates, our model reduces to a bag-of-words representation, that\nis, a max pooling of individual node embeddings.\nTo predict the usefulness of a statement for a conjecture, we send the concatenation of their embed-\ndings to a classi\ufb01er. The classi\ufb01cation can also be done in the unconditional setting where only the\nstatement is given; in this case we directly send the embedding of the statement to a classi\ufb01er. The\nparameters of the update functions and the classi\ufb01ers are learned end to end through backpropagation.\n\n3.3 Order-Preserving Embeddings\n\nFor functions in a formula, the order of its arguments matters. That is, f (x, y) cannot generally be\npresumed to mean the same as f (y, x). But our current embedding update as de\ufb01ned in Eqn. 1 is\ninvariant to the ordering of arguments. Given that it is possible that the ordering of arguments can\nbe a useful feature for premise selection, we now consider a variant of our basic approach to make\nour graph embeddings sensitive to the ordering of arguments. In this variant, we update each node\nconsidering the ordering of its incoming edges and outgoing edges.\nBefore we de\ufb01ne our new update equation, we need to introduce the notion of a treelet. Given a node\nv in graph G = (V, E), let (v, w) \u2208 E be an outgoing edge of v, and let rv(w) \u2208 {1, 2, . . .} be the\nrank of edge (v, w) among all outgoing edges of v. We de\ufb01ne a treelet of graph G = (V, E) as a\ntuple of nodes (u, v, w) \u2208 V \u00d7 V \u00d7 V such that (1) both (v, u) and (v, w) are edges in the graph\nand (2) (v, u) is ranked before (v, w) among all outgoing edges of v. In other words, a treelet is a\nsubgraph that consists of a head node v, a left child u and a right child w. We use TG to denote all\ntreelets of graph G, that is, TG = {(u, v, w) : (v, u) \u2208 E, (v, w) \u2208 E, rv(u) < rv(w)}.\nNow, when we update a node embedding, we consider not only its direct neighbors, but also its roles\nin all the treelets it belongs to:\n\nxt+1\nv = F t\nP\n\nxt\nv +\n\nF t\nI (xt\n\nu, xt\n\nF t\nO(xt\n\nv, xt\nu)\n\nF t\nL(xt\n\nv, xt\n\nu, xt\n\nw) +\n\nF t\nH (xt\n\nu, xt\n\nv, xt\n\nw) +\n\n(v,u,w)\u2208TG\n\n(u,v,w)\u2208TG\n\n(u,w,v)\u2208TG\n\n(2)\nwhere ev = |{(u, v, w) : (u, v, w) \u2208 TG \u2228 (v, u, w) \u2208 TG \u2228 (u, w, v) \u2208 TG}| is the number of total\ntreelets containing v. In this new update equation, FL is an update function that considers a treelet\nwhere node v is the left child. Similarly, FH considers a treelet where node v is the head and FR\nconsiders a treelet where node v is the right child. As in Sec. 3.2, the same update functions are\napplied to all nodes at each step, but across steps the update functions can be different. Fig. 3 shows\nthe update equation of a concrete example.\nOur design of Eqn. 2 now allows a node to be embedded differently dependent on the ordering of its\nown arguments and dependent on which argument slot it takes in a parent function. For example,\nthe function node f can now be embedded differently for f (a, b) and f (b, a) because of the output\nof FH can be different. As another example, in the formula g(f (a), f (a)), there are two function\n\n(cid:104) (cid:88)\n\n(u,v)\u2208E\n\n1\ndv\n\n(cid:16)\n(cid:104) (cid:88)\n\n+\n\n1\nev\n\n(cid:105)\n\n(cid:88)\n\n(v,u)\u2208E\n\nv) +\n\n(cid:88)\n\n(cid:88)\n\n(cid:105)(cid:17)\n\nF t\nR(xt\n\nu, xt\n\nw, xt\nv)\n\n5\n\nvuuuuuu\fFigure 4: Con\ufb01gurations of the update functions and classi\ufb01ers: (a) FP in Eqn. 1 and 2; (b) FI , FO\nin Eqn. 1 and 2, and FL, FH , FR in Eqn. 2; (c) conditional classi\ufb01er; (d) unconditional classi\ufb01er.\n\nnodes with the same name f, same parent g, and same child a, but they can be embedded differently\nbecause only FL will be applied to the f as the \ufb01rst argument of g and only FR will be applied to the\nf as the second argument of g.\nTo distinguish the two variants of our approach, we call the method with the treelet update terms\nFormulaNet, as opposed to the basic FormulaNet-basic without considering edge ordering.\n\n4 Experiments\n\n4.1 Dataset and Evaluation\n\nWe evaluate our approach on the HolStep dataset [10], a recently introduced benchmark for evaluating\nmachine learning approaches for theorem proving. It was constructed from the proof trace \ufb01les of\nthe HOL Light theorem prover [7] on its multivariate analysis library [48] and the formal proof\nof the Kepler conjecture. The dataset contains 11,410 conjectures, including 9,999 in the training\nset and 1,411 in the test set. Each conjecture is associated with a set of statements, each with a\nground truth label on whether the statement is useful for proving the conjecture. There are 2,209,076\nconjecture-statement pairs in total. We hold out 700 conjectures from the training set as the validation\nset to tune hyperparameters and perform ablation analysis.\nFollowing the evaluation setup proposed in [10], we treat premise selection as a binary classi\ufb01cation\ntask and evaluate classi\ufb01cation accuracy. Also following [10], we evaluate two settings, the condi-\ntional setting where both the conjecture and the statement are given, and the unconditional setting\nwhere the conjecture is ignored. In HolStep, each conjecture is associated with an equal number of\npositive statements and negative statements, so the accuracy of random prediction is 50%.\n\n4.2 Network Con\ufb01gurations\n\nThe initial one-hot vector for each node has 1909 dimensions, representing 1909 unique tokens.\nThese 1909 tokens include 1906 unique constants from the training set and three special tokens,\n\"VAR\", \"VARFUNC\", and \"UNKNOWN\" (representing all novel tokens during testing). We use a\nlinear layer to map one-hot encodings to 256-dimensional vectors. All of the following intermediate\nembeddings are 256-dimensional.\nThe update functions in Eqn. 1 and Eqn. 2 are parametrized as neural networks. Fig. 4 (a), (b) shows\ntheir con\ufb01gurations. All update functions are con\ufb01gured the same: concatenation of inputs followed\nby two fully connected layers with ReLUs, Batch Normalizations [49].\nThe classi\ufb01er for the conditional setting takes in the embeddings from the conjecture and the statement.\nIts con\ufb01guration is shown in Fig. 4 (c). The classi\ufb01er for the unconditional setting uses only the\nembedding of the statement; its con\ufb01guration is shown in Fig. 4 (d).\n\n4.3 Model Training\nWe train our networks using RMSProp [50] with 0.001 learning rate and 1 \u00d7 10\u22124 weight decay. We\nlower the learning rate by 3X after each epoch. We train all models for \ufb01ve epochs and all networks\nconverge after about three or four epochs.\nIt is worth noting that there are two levels of batching in our approach: intra-graph batching and\ninter-graph batching. Intra-graph batching arises from the fact that to embed a graph, each update\n\n6\n\nFH / FL / FR FCdim=256BNReLUFCdim=256BNReLUconcatFCdim=256BNReLUFCdim=128BNReLUFCdim=2FCdim=2FC256BNReLU(a)(b)(c)(d)FI / FOconcatconcatFP xv xu xv xu xw \fTable 1: Classi\ufb01cation accuracy on the test set of our approach versus baseline methods on HolStep\nin the unconditional setting (conjecture unknown) and the conditional setting (conjecture given).\n\nCNN [10] CNN-LSTM [10]\n\nUnconditional\nConditional\n\n83\n82\n\n83\n83\n\nFormulaNet-basic\n\n89.0\n89.1\n\nFormulaNet\n\n90.0\n90.3\n\nfunction (FP , FI , FO, FL, FH , FR in Eqn. 2) is applied to all nodes in parallel. This is the same as\ntraining each update function as a standalone network with a batch of input examples. Thus regular\nbatch normalization can be directly applied to the inputs of each update function within a single\ngraph, as shown in Fig. 4(a)(b).\nFurthermore, this batch normalization within a graph can be run in the training mode even when we\nare only performing inference to embed a graph, because there are multiple input examples to each\nupdate function within a graph. Another level of batching is the regular batching of multiple graphs\nin training, as is necessary for training the classi\ufb01er. As usual, batch normalization across graphs is\ndone in the evaluation mode in test time.\nWe also apply intermediate supervision after each step of embedding update using a separate classi\ufb01er.\nFor training, our loss function is the sum of cross-entropy losses for each step. We use the prediction\nfrom the last step as our \ufb01nal predictions.\n\n4.4 Main Results\n\nTable 1 compares the accuracy of our approach versus the best existing results [10]. Our approach\nimproves the best existing result by a large margin from 83% to 90.3% in the conditional setting\nand from 83% to 90.0% in the unconditional setting. We also see that FormulaNet gives a 1% im-\nprovement over the FormulaNet-basic, validating our hypothesis that the order of function arguments\nprovides useful cues.\nConsistent with prior work [10], conditional and unconditional selection have similar performances.\nThis is likely due to the data distribution in HolStep. In the training set, only 0.8% of the statements\nappear in both a positive statement-conjecture pair and a negative statement-conjecture pair, and the\nupper performance bound of unconditional selection is 97%. In addition, HolStep contains 9,999\nunique conjectures but 1,304,888 unique statements for training, so it is likely easier for the network\nto learn useful patterns from statements than from conjectures.\nWe also apply Deepwalk [30], an unsupervised approach for generating node embeddings that is\npurely based on graph topology without considering the token associated with each node. For each\nformula graph, we max-pool its node embeddings and train a classi\ufb01er. The accuracy is 61.8%\n(conditional) and 61.7% (unconditional). This result suggests that for embedding formulas it is\nimportant to use token information and end-to-end supervision.\n\n4.5 Ablation Experiments\n\nInvariance to Variable Renaming One motivation for our graph representation is that the meaning\nof formulas should be invariant to the renaming of variable values and variable functions. To achieve\nsuch invariance, we perform two main transformations of a parse tree to generate a graph: (1) we\nconvert the tree to a graph by linking quanti\ufb01ers and variables, and (2) we discard the variable names.\nWe now study the effect of these steps on the premise selection task. We compare FormulaNet-basic\nwith the following three variants whose only difference is the format of the input graph:\n\n\u2022 Tree-old-names: Use the parse tree as the graph and keep all original names for the nodes.\n\nAn example is the tree in Fig. 2 (b).\n\n\u2022 Tree-renamed: Use the parse tree as the graph but rename all variable values to VAR and\n\nvariable functions to VARFUNC.\n\n\u2022 Graph-old-names: Use the same graph as FormulaNet-basic but keep all original names for\nthe nodes, thus making the graph embedding dependent on the original variable names. An\nexample is the graph in Fig. 2 (c).\n\n7\n\n\fTable 2: The accuracy of FormulaNet-basic and its ablated versions on original and renamed validation\nset.\n\nTree-old-names Tree-renamed Graph-old-names Our Graph\n\nOriginal Validation\nRenamed Validation\n\n89.7\n82.3\n\n84.7\n84.7\n\n89.8\n83.5\n\n89.9\n89.9\n\nTable 3: Validation accuracy of proposed models with different numbers of update steps on conditional\npremise selection.\n\nNumber of steps\nFormulaNet-basic\nFormulaNet\n\n0\n\n81.5\n81.5\n\n1\n\n89.3\n90.4\n\n2\n\n89.8\n91.0\n\n3\n\n89.9\n91.1\n\n4\n\n90.0\n90.8\n\nWe train these variants on the same training set as FormulaNet-basic. To compare with FormulaNet-\nbasic, we evaluate them on the same held-out validation set. In addition, we generate a new validation\nset (Renamed Validation) by randomly permutating the variable names in the formulas\u2014the textual\nrepresentation is different but the semantics remains the same. We also compare all models on this\nrenamed validation set to evaluate their robustness to variable renaming.\nTable 2 reports the results. If we use a tree with the original names, there is a slight drop when\nevaluate on the original validation set, but there is a very large drop when evaluated on the renamed\nvalidation set. This shows that there are features exploitable in the original variable names and the\nmodel is exploiting it, but the model is essentially over\ufb01tting to the bias in the original names and\ncannot generalize to renamed formulas. The same applies to the model trained on graphs with the\noriginal names, whose performance also drops drastically on renamed formulas.\nIt is also interesting to note that the model trained on renamed trees performs poorly, although it is\ninvariant to variable renaming. This shows that the syntactic and semantic information encoded in\nthe graph on variables\u2014particularly their quanti\ufb01ers and coreferences\u2014is important.\n\n4.6 Visualization of Embeddings\n\nNumber of Update Steps An important hyperparameter of our approach is the number of steps\nto update the embeddings. Zero steps can only embed a bag of unstructured tokens, while more\nsteps can embed information from larger graph structures. Table 3 compares the accuracy of models\nwith different numbers of update steps. Perhaps surprisingly, models with zero steps can already\n\nFigure 5: Nearest neighbors of node embeddings after step 1 with FormulaNet. Query nodes are in\nthe \ufb01rst column. The color of each node is coded by a t-SNE [51] projection of its step-0 embedding\ninto 2D. The closer the colors, the nearer two nodes are in the step-0 embedding space.\n\n8\n\nreal_gtextreme_point_ofINT=hullDISJOINTVARNOTNOTNOTNOTNOTNOTNOTNOTNOT=>=>=>=>=>=>=>=>VARVARVARVARVARVARFUNCVARFUNCVARFUNCVARFUNCVARFUNC=complex_mulcondensation_point_of=>ALLvector_subVARcasnVAR=_cVARcontinuousVARFSTVARNUMERALVAR=======>=>=>=>=>=>\fachieve an accuracy of 81.5%, showing that much of the performance comes from just the names\nof constant functions and values. More steps lead to notable increases of accuracy, showing that\nthe structures in the graph are important. There is a diminishing return after 3 steps, but this can\nbe reasonably expected because a radius of 3 in a graph is a fairly sizable neighborhood and can\nencompass reasonably complex expressions\u2014a node can in\ufb02uence its grand-grandchildren and\ngrand-grandparents. In addition, it would naturally be more dif\ufb01cult to learn generalizable features\nfrom long-range patterns because they are more varied and each of them occurs much less frequently.\nTo qualitatively examine the learned embeddings, we \ufb01nd out a set of nodes with similar embeddings\nand visualize their local structures in Fig. 5. In each row, we use a node as the query and \ufb01nd the\nnearest neighbors across all nodes from different graphs. We can see that the nearest neighbors have\nsimilar structures in terms of topology and naming. This demonstrates that our graph embeddings\ncan capture syntactic and semantic structures of a formula.\n\n5 Conclusion\n\nIn this work, we have proposed a deep learning-based approach to premise selection. We represent\na higher-order logic formula as a graph that is invariant to variable renaming but fully preserves\nsyntactic and semantic information. We then embed the graph into a continuous vector through a\nnovel embedding method that preserves the information of edge ordering. Our approach has achieved\nstate-of-the-art results on the HolStep dataset, improving the classi\ufb01cation accuracy from 83% to\n90.3%.\n\nAcknowledgements This work is partially supported by the National Science Foundation under\nGrant No. 1633157.\n\nReferences\n[1] Alan JA Robinson and Andrei Voronkov. Handbook of automated reasoning, volume 1. Elsevier, 2001.\n\n[2] Christoph Kern and Mark R Greenstreet. Formal veri\ufb01cation in hardware design: a survey. ACM\n\nTransactions on Design Automation of Electronic Systems (TODAES), 4(2):123\u2013193, 1999.\n\n[3] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika\nElkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, et al. sel4: Formal veri\ufb01cation of an os kernel.\nIn Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 207\u2013220.\nACM, 2009.\n\n[4] Xavier Leroy. Formal veri\ufb01cation of a realistic compiler. Communications of the ACM, 52(7):107\u2013115,\n\n2009.\n\n[5] Alexander A Alemi, Francois Chollet, Niklas Een, Geoffrey Irving, Christian Szegedy, and Josef Urban.\nDeepmath - deep sequence models for premise selection. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2235\u20132243.\nCurran Associates, Inc., 2016.\n\n[6] Adam Naumowicz and Artur Korni\u0142owicz. A Brief Overview of Mizar, pages 67\u201372. Springer Berlin\n\nHeidelberg, Berlin, Heidelberg, 2009.\n\n[7] John Harrison. HOL Light: An Overview, pages 60\u201366. Springer Berlin Heidelberg, Berlin, Heidelberg,\n\n2009.\n\n[8] Kry\u0161tof Hoder and Andrei Voronkov. Sine qua non for large theory reasoning. In International Conference\n\non Automated Deduction, pages 299\u2013314. Springer, 2011.\n\n[9] Jesse Alama, Tom Heskes, Daniel K\u00fchlwein, Evgeni Tsivtsivadze, and Josef Urban. Premise selection for\nmathematics by corpus analysis and kernel methods. Journal of Automated Reasoning, 52(2):191\u2013213,\n2014.\n\n[10] Cezary Kaliszyk, Fran\u00e7ois Chollet, and Christian Szegedy. Holstep: A machine learning dataset for\n\nhigher-order logic theorem proving. arXiv preprint arXiv:1703.00426, 2017.\n\n[11] John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational\n\nLogic, volume 9, pages 135\u2013214, 2014.\n\n9\n\n\f[12] Laura Kov\u00e1cs and Andrei Voronkov. First-order theorem proving and vampire. In International Conference\n\non Computer Aided Veri\ufb01cation, pages 1\u201335. Springer, 2013.\n\n[13] Stephan Schulz. E\u2013a brainiac theorem prover. Ai Communications, 15(2, 3):111\u2013126, 2002.\n\n[14] Gilles Dowek, Amy Felty, Hugo Herbelin, G\u00e9rard Huet, Chetan Murthy, Catherin Parent, Christine\nPaulin-Mohring, and Benjamin Werner. The COQ Proof Assistant: User\u2019s Guide: Version 5.6. INRIA,\n1992.\n\n[15] Makarius Wenzel, Lawrence C Paulson, and Tobias Nipkow. The isabelle framework. In International\n\nConference on Theorem Proving in Higher Order Logics, pages 33\u201338. Springer, 2008.\n\n[16] Thomas Hales, Mark Adams, Gertrud Bauer, Dat Tat Dang, John Harrison, Truong Le Hoang, Cezary\nKaliszyk, Victor Magron, Sean McLaughlin, Thang Tat Nguyen, et al. A formal proof of the kepler\nconjecture. arXiv preprint arXiv:1501.02155, 2015.\n\n[17] Georges Gonthier, Andrea Asperti, Jeremy Avigad, Yves Bertot, Cyril Cohen, Fran\u00e7ois Garillot, St\u00e9phane\nLe Roux, Assia Mahboubi, Russell O\u2019Connor, Sidi Ould Biha, Ioana Pasca, Laurence Rideau, Alexey\nSolovyev, Enrico Tassi, and Laurent Th\u00e9ry. A Machine-Checked Proof of the Odd Order Theorem, pages\n163\u2013179. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.\n\n[18] Christian Suttner and Wolfgang Ertel. Automatic acquisition of search guiding heuristics. In International\n\nConference on Automated Deduction, pages 470\u2013484. Springer, 1990.\n\n[19] J\u00f6rg Denzinger, Matthias Fuchs, Christoph Goller, and Stephan Schulz. Learning from previous proof\n\nexperience: A survey. Citeseer, 1999.\n\n[20] SA Schulz. Learning search control knowledge for equational deduction, volume 230. IOS Press, 2000.\n\n[21] Michael F\u00e4rber and Chad Brown. Internal guidance for satallax. In International Joint Conference on\n\nAutomated Reasoning, pages 349\u2013361. Springer, 2016.\n\n[22] Cezary Kaliszyk and Josef Urban. Femalecop: fairly ef\ufb01cient machine learning connection prover. In\n\nLogic for Programming, Arti\ufb01cial Intelligence, and Reasoning, pages 88\u201396. Springer, 2015.\n\n[23] Daniel Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. arXiv preprint\n\narXiv:1608.02644, 2016.\n\n[24] Jan Jakub\u02dauv and Josef Urban. Enigma: Ef\ufb01cient learning-based inference guiding machine. arXiv preprint\n\narXiv:1701.06532, 2017.\n\n[25] Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof search.\n\narXiv preprint arXiv:1701.06972, 2017.\n\n[26] Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y. Ng. Dynamic\npooling and unfolding recursive autoencoders for paraphrase detection. In J. Shawe-Taylor, R. S. Zemel,\nP. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 24, pages 801\u2013809. Curran Associates, Inc., 2011.\n\n[27] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and natural\nlanguage with recursive neural networks. In Proceedings of the 28th international conference on machine\nlearning (ICML-11), pages 129\u2013136, 2011.\n\n[28] Daniel K\u00fchlwein and Josef Urban. Males: A framework for automatic tuning of automated theorem\n\nprovers. Journal of Automated Reasoning, 55(2):91\u2013116, 2015.\n\n[29] James P. Bridge, Sean B. Holden, and Lawrence C. Paulson. Machine learning for \ufb01rst-order theorem\n\nproving. Journal of Automated Reasoning, 53(2):141\u2013172, 2014.\n\n[30] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 701\u2013710. ACM, 2014.\n\n[31] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale\ninformation network embedding. In Proceedings of the 24th International Conference on World Wide Web,\npages 1067\u20131077. ACM, 2015.\n\n[32] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the\n22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855\u2013864.\nACM, 2016.\n\n10\n\n\f[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations\nof words and phrases and their compositionality. In Advances in neural information processing systems,\npages 3111\u20133119, 2013.\n\n[34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[35] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by backpropa-\ngation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages\n347\u2013352. IEEE, 1996.\n\n[36] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from\n\ntree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\n[37] Dipendra Kumar Misra and Yoav Artzi. Neural shift-reduce ccg semantic parsing. In EMNLP, pages\n\n1775\u20131786, 2016.\n\n[38] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In\nNeural Networks, 2005. IJCNN\u201905. Proceedings. 2005 IEEE International Joint Conference on, volume 2,\npages 729\u2013734. IEEE, 2005.\n\n[39] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[40] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[41] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[42] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning on spatio-\ntemporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5308\u20135317, 2016.\n\n[43] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data.\n\narXiv preprint arXiv:1506.05163, 2015.\n\n[44] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs\nIn Advances in Neural Information Processing Systems, pages\n\nwith fast localized spectral \ufb01ltering.\n3837\u20133845, 2016.\n\n[45] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[46] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for\n\ngraphs. In Proceedings of the 33rd annual international conference on machine learning. ACM, 2016.\n\n[47] Alonzo Church. A formulation of the simple theory of types. The journal of symbolic logic, 5(02):56\u201368,\n\n1940.\n\n[48] John Harrison. The hol light theory of euclidean space. Journal of Automated Reasoning, pages 1\u201318,\n\n2013.\n\n[49] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[50] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini\u2013batch gradient\ndescent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012.\n\n[51] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning\n\nResearch, 9(Nov):2579\u20132605, 2008.\n\n11\n\n\f", "award": [], "sourceid": 1567, "authors": [{"given_name": "Mingzhe", "family_name": "Wang", "institution": "University of Michigan"}, {"given_name": "Yihe", "family_name": "Tang", "institution": "Carnegie Mellon University"}, {"given_name": "Jian", "family_name": "Wang", "institution": "University of Michigan"}, {"given_name": "Jia", "family_name": "Deng", "institution": "University of Michigan"}]}