{"title": "Learning to Discover Efficient Mathematical Identities", "book": "Advances in Neural Information Processing Systems", "page_first": 1278, "page_last": 1286, "abstract": "In this paper we explore how machine learning techniques can be applied to the discovery of efficient mathematical identities. We introduce an attribute grammar framework for representing symbolic expressions. Given a grammar of math operators, we build trees that combine them in different ways, looking for compositions that are analytically equivalent to a target expression but of lower computational complexity. However, as the space of trees grows exponentially with the complexity of the target expression, brute force search is impractical for all but the simplest of expressions. Consequently, we introduce two novel learning approaches that are able to learn from simpler expressions to guide the tree search. The first of these is a simple n-gram model, the other being a recursive neural-network. We show how these approaches enable us to derive complex identities, beyond reach of brute-force search, or human derivation.", "full_text": "Learning to Discover\n\nEf\ufb01cient Mathematical Identities\n\nWojciech Zaremba\n\nDept. of Computer Science\n\nCourant Institute\n\nNew York Unviersity\n\nKarol Kurach\nGoogle Zurich &\n\nDept. of Computer Science\n\nUniversity of Warsaw\n\nRob Fergus\n\nDept. of Computer Science\n\nCourant Institute\n\nNew York Unviersity\n\nAbstract\n\nIn this paper we explore how machine learning techniques can be applied to the\ndiscovery of ef\ufb01cient mathematical identities. We introduce an attribute gram-\nmar framework for representing symbolic expressions. Given a grammar of math\noperators, we build trees that combine them in different ways, looking for compo-\nsitions that are analytically equivalent to a target expression but of lower compu-\ntational complexity. However, as the space of trees grows exponentially with the\ncomplexity of the target expression, brute force search is impractical for all but\nthe simplest of expressions. Consequently, we introduce two novel learning ap-\nproaches that are able to learn from simpler expressions to guide the tree search.\nThe \ufb01rst of these is a simple n-gram model, the other being a recursive neural-\nnetwork. We show how these approaches enable us to derive complex identities,\nbeyond reach of brute-force search, or human derivation.\n\nIntroduction\n\n1\nMachine learning approaches have proven highly effective for statistical pattern recognition prob-\nlems, such as those encountered in speech or vision. However, their use in symbolic settings has\nbeen limited. In this paper, we explore how learning can be applied to the discovery of mathematical\nidentities. Speci\ufb01cally, we propose methods for \ufb01nding computationally ef\ufb01cient versions of a given\ntarget expression. That is, \ufb01nding a new expression which computes an identical result to the target,\nbut has a lower complexity (in time and/or space).\nWe introduce a framework based on attribute grammars [14] that allows symbolic expressions to be\nexpressed as a sequence of grammar rules. Brute-force enumeration of all valid rule combinations\nallows us to discover ef\ufb01cient versions of the target, including those too intricate to be discovered by\nhuman manipulation. But for complex target expressions this strategy quickly becomes intractable,\ndue to the exponential number of combinations that must be explored. In practice, a random search\nwithin the grammar tree is used to avoid memory problems, but the chance of \ufb01nding a matching\nsolution becomes vanishingly small for complex targets.\nTo overcome this limitation, we use machine learning to produce a search strategy for the grammar\ntrees that selectively explores branches likely (under the model) to yield a solution. The training\ndata for the model comes from solutions discovered for simpler target expressions. We investigate\nseveral different learning approaches. The \ufb01rst group are n-gram models, which learn pairs, triples\netc. of expressions that were part of previously discovered solutions, thus hopefully might be part\nof the solution for the current target. We also train a recursive neural network (RNN) that operates\nwithin the grammar trees. This model is \ufb01rst pretrained to learn a continuous representation for\nsymbolic expressions. Then, using this representation we learn to predict the next grammar rule to\nadd to the current expression to yield an ef\ufb01cient version of the target.\nThrough the use of learning, we are able to dramatically widen the complexity and scope of expres-\n\nsions that can be handled in our framework. We show examples of (i) O(cid:0)n3(cid:1) target expressions\nwhich can be computed in O(cid:0)n2(cid:1) time (e.g. see Examples 1 & 2), and (ii) cases where naive eval-\n\n1\n\n\fuation of the target would require exponential time, but can be computed in O(cid:0)n2(cid:1) or O(cid:0)n3(cid:1) time.\n\nThe majority of these examples are too complex to be found manually or by exhaustive search and,\nas far as we are aware, are previously undiscovered. All code and evaluation data can be found at\nhttps://github.com/kkurach/math_learning.\nIn summary our contributions are:\n\n\u2022 A novel grammar framework for \ufb01nding ef\ufb01cient versions of symbolic expressions.\n\u2022 Showing how machine learning techniques can be integrated into this framework, and\ndemonstrating how training models on simpler expressions can help which the discovery\nof more complex ones.\n\u2022 A novel application of a recursive neural-network to learn a continuous representation of\nmathematical structures, making the symbolic domain accessible to many other learning\napproaches.\n\u2022 The discovery of many new mathematical identities which offer a signi\ufb01cant reduction in\n\ncomputational complexity for certain expressions.\n\n(cid:80)p\n\ni=1\n\nj=1\n\ntarget expression: sum(sum(A*B)), i.e. :(cid:80)\n\nExample 1: Assume we are given matrices A \u2208 Rn\u00d7m, B \u2208 Rm\u00d7p. We wish to compute the\nk=1 Ai,jBj,k which\n\nn,p AB =(cid:80)n\n\n(cid:80)m\n\nnaively takes O(nmp) time. Our framework is able to discover an ef\ufb01cient version of the\nformula, that computes the same result in O(n(m + p)) time: sum((sum(A, 1) * B)\u2019, 1).\nOur framework builds grammar trees that explore valid compositions of expressions from the\ngrammar, using a search strategy. In this example, the naive strategy of randomly choosing\npermissible rules suf\ufb01ces and we can \ufb01nd another tree which matches the target expression in\nreasonable time. Below, we show trees for (i) the original expression and (ii) the ef\ufb01cient\nformula which avoids the use of a matrix-matrix multiply operation, hence is ef\ufb01cient to\ncompute.\n\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\nExample 2: Consider the target expression: sum(sum((A*B)k)), where k = 6. For an\nexpression of this degree, there are 9785 possible grammar trees and the naive strategy used in\nExample 1 breaks down. We therefore learn a search strategy, training a model on successful\ntrees from simpler expressions, such as those for k = 2, 3, 4, 5. Our learning approaches capture\nthe common structure within the solutions, evident below, so can \ufb01nd an ef\ufb01cient O (nm)\nexpression for this target:\nk = 2: sum((((((sum(A, 1)) * B) * A) * B)\u2019), 1)\nk = 3: sum((((((((sum(A, 1)) * B) * A) * B) * A) * B)\u2019), 1)\nk = 4: sum((((((((((sum(A, 1)) * B) * A) * B) * A) * B) * A) * B)\u2019), 1)\nk = 5: sum((((((((((((sum(A, 1)) * B) * A) * B) * A) * B) * A) * B) * A) * B)\u2019), 1)\nk = 6: sum(((((((((((((sum(A, 1) * B) * A) * B) *A) * B) * A) * B)* A) * B) * A) * B)\u2019), 1)\n\n1.1 Related work\nThe problem addressed in this paper overlaps with the areas of theorem proving [5, 9, 11], program\ninduction [18, 28] and probabilistic programming [12, 20]. These domains involve the challenging\nissues of undecidability, the halting problem, and a massive space of potential computation. How-\never, we limit our domain to computation of polynomials with \ufb01xed degree k, where undecidability\nand the halting problem are not present, and the space of computation is manageable (i.e. it grows\nexponentially, but not super-exponentially). Symbolic computation engines, such as Maple [6] and\nMathematica [27] are capable of simplifying expressions by collecting terms but do not explicitly\nseek versions of lower complexity. Furthermore, these systems are rule based and do not use learn-\ning approaches, the major focus of this paper. In general, there has been very little exploration of\nstatistical machine learning techniques in these \ufb01elds, one of the few attempts being the recent work\nof Bridge et al. [4] who use learning to select between different heuristics for 1st order reasoning. In\ncontrast, our approach does not use hand-designed heuristics, instead learning them automatically\nfrom the results of simpler expressions.\n\n2\n\n\fInput\nRule\nX \u2208 Rn\u00d7m , Y \u2208 Rm\u00d7p\nMatrix-matrix multiply\nMatrix-element multiply X \u2208 Rn\u00d7m , Y \u2208 Rn\u00d7m\nX \u2208 Rn\u00d7m , Y \u2208 Rm\u00d71\nMatrix-vector multiply\nX \u2208 Rn\u00d7m\nMatrix transpose\nX \u2208 Rn\u00d7m\nColumn sum\nX \u2208 Rn\u00d7m\nRow sum\nX \u2208 Rn\u00d71\nColumn repeat\nX \u2208 R1\u00d7m\nRow repeat\nX \u2208 R1\u00d71\nElement repeat\n\nOutput\nZ \u2208 Rn\u00d7p\nZ \u2208 Rn\u00d7m\nZ \u2208 Rn\u00d7n\nZ \u2208 Rm\u00d7n\nZ \u2208 Rn\u00d71\nZ \u2208 R1\u00d7m\nZ \u2208 Rn\u00d7m\nZ \u2208 Rn\u00d7m\nZ \u2208 Rn\u00d7m\n\nComplexity\nComputation\nZ = X * Y\nO (nmp)\nZ = X .* Y\nO (nm)\nZ = X * Y\nO (nm)\nZ = XT\nO (nm)\nZ = sum(X,1)\nO (nm)\nZ = sum(X,2)\nO (nm)\nZ = repmat(X,1,m) O (nm)\nZ = repmat(X,n,1) O (nm)\nZ = repmat(X,n,m) O (nm)\n\nTable 1: The grammar G used in our experiments.\n\nThe attribute grammar, originally developed in 1968 by Knuth [14] in context of compiler construc-\ntion, has been successfully used as a tool for design and formal speci\ufb01cation.\nIn our work, we\napply attribute grammars to a search and optimization problem. This has previously been explored\nin a range of domains: from well-known algorithmic problems like knapsack packing [19], through\nbioinformatics [26] to music [10]. However, we are not aware of any previous work related to dis-\ncovering mathematical formulas using grammars, and learning in such framework. The closest work\nto ours can be found in [7] which involves searching over the space of algorithms and the grammar\nattributes also represent computational complexity.\nClassical techniques in natural language processing make extensive use of grammars, for example\nto parse sentences and translate between languages. In this paper, we borrow techniques from NLP\nand apply them to symbolic computation. In particular, we make use of an n-gram model over\nmathematical operations, inspired by n-gram language models. Recursive neural networks have\nalso been recently used in NLP, for example by Luong et al. [15] and Socher et al. [22, 23], as well\nas generic knowledge representation Bottou [2]. In particular, Socher et al. [23], apply them to parse\ntrees for sentiment analysis. By contrast, we apply them to trees of symbolic expressions. Our work\nalso has similarities to Bowman [3] who shows that a recursive network can learn simple logical\npredicates.\nOur demonstration of continuous embeddings for symbolic expressions has parallels with the em-\nbeddings used in NLP for words and sentence structure, for example, Collobert & Weston [8], Mnih\n& Hinton [17], Turian et al. [25] and Mikolov et al. [16].\n2 Problem Statement\nProblem De\ufb01nition: We are given a symbolic target expression T that combines a set of variables V\nto produce an output O, i.e. O = T(V). We seek an alternate expression S, such that S(V) = T(V),\nbut has lower computational complexity, i.e. O (S) < O (T).\nIn this paper we consider the restricted setting where: (i) T is a homogeneous polynomial of degree\nk\u2217, (ii) V contains a single matrix or vector A and (iii) O is a scalar. While these assumptions may\nseem quite restrictive, they still permit a rich family of expressions for our algorithm to explore.\nFor example, by combining multiple polynomial terms, an ef\ufb01cient Taylor series approximation\ncan be found for expressions involving trigonometric or exponential operators. Regarding (ii), our\nframework can easily handle multiple variables, e.g. Figure 1, which shows expressions using two\nmatrices, A and B. However, the rest of the paper considers targets based on a single variable. In\nSection 8, we discuss these restrictions further.\nNotation: We adopt Matlab-style syntax for expressions.\n3 Attribute Grammar\nWe \ufb01rst de\ufb01ne an attribute grammar G, which contains a set of mathematical operations, each with\nan associated complexity (the attribute). Since T contains exclusively polynomials, we use the\ngrammar rules listed in Table 1.\nUsing these rules we can develop trees that combine rules to form expressions involving V, which\nfor the purposes of this paper is a single matrix A. Since we know T involves expressions of degree\n\u2217I.e. It only contains terms of degree k. E.g. ab + a2 + ac is a homogeneous polynomial of degree 2, but\n\na2 + b is not homogeneous (b is of degree 1, but a2 is of degree 2).\n\n3\n\n\fk, each tree must use A exactly k times. Furthermore, since the output is a scalar, each tree must\nalso compute a scalar quantity. These two constraints limit the depth of each tree. For some targets\nT whose complexity is only O (() n3), we remove the matrix-matrix multiply rule, thus ensuring\nthat if any solution is found its complexity is at most O (() n2) (see Section 7.2 for more details).\nExamples of trees are shown in Fig. 1. The search strategy for determining which rules to combine\nis addressed in Section 6.\n4 Representation of Symbolic Expressions\nWe need an ef\ufb01cient way to check if the expression produced by a given tree, or combination of trees\n(see Section 5), matches T. The conventional approach would be to perform this check symbolically,\nbut this is too slow for our purposes and is not amenable to integration with learning methods. We\ntherefore explore two alternate approaches.\n4.1 Numerical Representation\nIn this representation, each expression is represented by its evaluation of a randomly drawn set of\nN points, where N is large (typically 1000). More precisely, for each variable in V, N different\ncopies are made, each populated with randomly drawn elements. The target expression evaluates\neach of these copies, producing a scalar value for each, so yielding a vector t of length N which\nuniquely characterizes T. Formally, tn = T(Vn). We call this numerical vector t the descriptor\nof the symbolic expression T. The size of the descriptor N, must be suf\ufb01ciently large to ensure\nthat different expressions are not mapped to the same descriptor. Furthermore, when the descriptors\nare used in the linear system of Eqn. 5 below, N must also be greater than the number of linear\nequations. Any expression S formed by the grammar can be used to evaluate each Vn to produce\nanother N-length descriptor vector s, which can then be compared to t. If the two match, then\nS(V) = T(V).\nIn practice, using \ufb02oating point values can result in numerical issues that prevent t and s matching,\neven if the two expressions are equivalent. We therefore use an integer-based descriptor in the form\n\u2020, where p is a large prime number. This prevents both rounding issues as well as numerical\nof Zp\nover\ufb02ow.\n4.2 Learned Representation\nWe now consider how to learn a continuous representation for symbolic expressions, that is learn a\nprojection \u03c6 which maps expressions S to l-dimensional vectors: \u03c6(S) \u2192 Rl. We use a recursive\nneural network (RNN) to do this, in a similar fashion to Socher et al. [23] for natural language\nand Bowman et al. [3] for logical expressions. This potentially allows many symbolic tasks to be\nperformed by machine learning techniques, in the same way that the word-vectors (e.g.[8] and [16])\nenable many NLP tasks to be posed a learning problems.\nWe \ufb01rst create a dataset of symbolic expressions, spanning the space of all valid expressions up to\ndegree k. We then group them into clusters of equivalent expressions (using the numerical represen-\ntation to check for equality), and give each cluster a discrete label 1 . . . C. For example, A, (AT )T\ni Ai,j might have label 2 and so on. For k = 6, the\ndataset consists of C = 1687 classes, examples of which are show in Fig. 1. Each class is split\n80/20 into train/test sets.\nWe then train a recursive neural network (RNN) to classify a grammar tree into one of the C clusters.\nInstead of representing each grammar rule by its underlying arithmetic, we parameterize it by a\nweight matrix or tensor (for operations with one or two inputs, respectively) and use this to learn\nthe concept of each operation, as part of the network. A vector a \u2208 Rl, where l = 30\u2021 is used\nto represent each input variable. Working along the grammar tree, each operation in S evolves this\nvector via matrix/tensor multiplications (preserving its length) until the entire expression is parsed,\nresulting in a single vector \u03c6(S) of length l, which is passed to the classi\ufb01er to determine the class\nof the expression, and hence which other expressions it is equivalent to.\nFig. 2 shows this procedure for two different expressions. Consider the \ufb01rst expression S = (A. \u2217\nA)(cid:48) \u2217 sum(A, 2). The \ufb01rst operation here is .\u2217, which is implemented in the RNN by taking the\n\nmight have label 1, and(cid:80)\n\n(cid:80)\n\nj Ai,j,(cid:80)\n(cid:80)\n\nj\n\ni\n\n\u2020Integers modulo p\n\u2021This was selected by cross-validation to control the capacity of the RNN, since it directly controls the\n\nnumber of parameters in the model.\n\n4\n\n\ftwo (identical) vectors a and applies a weight tensor W3 (of size l \u00d7 l \u00d7 l, so that the output is\nalso size l), followed by a recti\ufb01ed-linear non-linearity. The output of this stage is this max((W3 \u2217\na) \u2217 a, 0). This vector is presented to the next operation, a matrix transpose, whose output is thus\nmax(W2 \u2217 max((W3 \u2217 a) \u2217 a, 0), 0). Applying the remaining operations produces a \ufb01nal output:\n\u03c6(S) = max((W4 \u2217 max(W2 \u2217 max((W3 \u2217 a) \u2217 a, 0), 0)) \u2217 max(W1 \u2217 a, 0)). This is presented to a\nC-way softmax classi\ufb01er to predict the class of the expression. The weights W are trained using a\ncross-entropy loss and backpropagation.\n\n(((sum((sum((A * (A\u2019)), 1)), 2)) * ((A * (((sum((A\u2019), 1)) * A)\u2019))\u2019)) * A)\n(sum(((sum((A * (A\u2019)), 2)) * ((sum((A\u2019), 1)) * (A * ((A\u2019) * A)))), 1))\n(((sum(A, 1)) * (((sum(A, 2)) * (sum(A, 1)))\u2019)) * (A * ((A\u2019) * A)))\n((((sum((sum((A * (A\u2019)), 1)), 2)) * ((sum((A\u2019), 1)) * (A * ((A\u2019) * A))))\u2019)\u2019)\n((sum(A, 1)) * (((A\u2019) * (A * ((A\u2019) * ((sum(A, 2)) * (sum(A, 1))))))\u2019))\n((sum((sum((A * (A\u2019)), 1)), 2)) * ((sum((A\u2019), 1)) * (A * ((A\u2019) * A))))\n(((sum((sum((A * (A\u2019)), 1)), 2)) * ((sum((A\u2019), 1)) * A)) * ((A\u2019) * A))\n\n((A\u2019) * ((sum(A, 2)) * ((sum((A\u2019), 1)) * (A * (((sum((A\u2019), 1)) * A)\u2019)))))\n(sum(((A\u2019) * ((sum(A, 2)) * ((sum((A\u2019), 1)) * (A * ((A\u2019) * A))))), 2))\n((((sum(A, 2)) * ((sum((A\u2019), 1)) * A))\u2019) * (A * (((sum((A\u2019), 1)) * A)\u2019)))\n(((sum((A\u2019), 1)) * (A * ((A\u2019) * ((sum(A, 2)) * ((sum((A\u2019), 1)) * A)))))\u2019)\n((((sum((A\u2019), 1)) * A)\u2019) * ((sum((A\u2019), 1)) * (A * (((sum((A\u2019), 1)) * A)\u2019))))\n(((A * ((A\u2019) * ((sum(A, 2)) * ((sum((A\u2019), 1)) * A))))\u2019) * (sum(A, 2)))\n(((A\u2019) * ((sum(A, 2)) * ((sum((A\u2019), 1)) * A))) * (sum(((A\u2019) * A), 2)))\n\n(a) Class A\n\n(b) Class B\n\nFigure 1: Samples from two classes of degree k = 6 in our dataset of expressions, used to learn\na continuous representation of symbolic expressions via an RNN. Each line represents a different\nexpression, but those in the same class are equivalent to one another.\n\n(a) (A. \u2217 A)(cid:48) \u2217 sum(A, 2)\n\n(b) (A(cid:48). \u2217 A(cid:48)) \u2217 sum(A, 2)\n\n.\n\n.\nFigure 2: Our RNN applied to two expressions. The matrix A is represented by a \ufb01xed random\nvector a (of length l = 30). Each operation in the expression applies a different matrix (for single\ninput operations) or tensor (for dual inputs, e.g. matrix-element multiplication) to this vector. After\neach operation, a recti\ufb01ed-linear non-linearity is applied. The weight matrices/tensors for each\noperation are shared across different expressions. The \ufb01nal vector is passed to a softmax classi\ufb01er\n(not shown) to predict which class they belong to. In this example, both expressions are equivalent,\nthus should be mapped to the same class.\n\nWhen training the RNN, there are several important details that are crucial to obtaining high classi-\n\ufb01cation accuracy:\n\n\u2022 The weights should be initialized to the identity, plus a small amount of Gaussian noise\nadded to all elements. The identity allows information to \ufb02ow the full length of the network,\nup to the classi\ufb01er regardless of its depth [21]. Without this, the RNN over\ufb01ts badly,\nproducing test accuracies of \u223c 1%.\n\n\u2022 Recti\ufb01ed linear units work much better in this setting than tanh activation functions.\n\u2022 We learn using a curriculum [1], starting with the simplest expressions of low degree and\n\u2022 The weight matrix in the softmax classi\ufb01er has much larger (\u00d7100) learning rate than the\nrest of the layers. This encourages the representation to stay still even when targets are\nreplaced, for example, as we move to harder examples.\n\u2022 As well as updating the weights of the RNN, we also update the initial value of a (i.e we\n\nslowly increasing k.\n\nbackpropagate to the input also).\n\nthe C-way softmax used in training is removed from the network).\n\nWhen the RNN-based representation is employed for identity discovery (see Section 6.3), the vector\n\u03c6(S) is used directly (i.e.\n5 Linear Combinations of Trees\nFor simple targets, an expression that matches the target may be contained within a single grammar\ntree. But more complex expressions typically require a linear combination of expressions from\ndifferent trees.\n\n5\n\n\fTo handle this, we can use the integer-based descriptors for each tree in a linear system and solve\nfor a match to the target descriptor (if one exists). Given a set of M trees, each with its own integer\ndescriptor vector f, we form an M by N linear system of equations and solve it:\n\nF w = t mod Zp\n\nwhere F = [f1, . . . , fM ] holds the tree representations, w is the weighting on each of the trees\nand t is the target representation. The system is solved using Gaussian elimination, where addition\nand multiplication is performed modulo p. The number of solutions can vary: (a) there can be no\nsolution, which means that no linear combination of the current set of trees can match the target\nexpression. If all possible trees have been enumerated, then this implies the target expression is\noutside the scope of the grammar. (b) There can be one or more solutions, meaning that some\ncombination of the current set of trees yields a match to the target expression.\n6 Search Strategy\nSo far, we have proposed a grammar which de\ufb01nes the computations that are permitted (like a\nprogramming language grammar), but it gives no guidance as to how explore the space of possible\nexpressions. Neither do the representations we introduced help \u2013 they simply allow us to determine\nif an expression matches or not. We now describe how to ef\ufb01ciently explore the space by learning\nwhich paths are likely to yield a match.\nOur framework uses two components: a scheduler, and a strategy. The scheduler is \ufb01xed, and tra-\nverses space of expressions according to recommendations given by the selected strategy (e.g. \u201cRan-\ndom\u201d or \u201cn-gram\u201d or \u201cRNN\u201d). The strategy assesses which of the possible grammar rules is likely\nto lead to a solution, given the current expression. Starting with the variables V (in our case a single\nelement A, or more generally, the elements A, B etc.), at each step the scheduler receives scores\nfor each rule from the strategy and picks the one with the highest score. This continues until the\nexpression reaches degree k and the tree is complete. We then run the linear solver to see if a linear\ncombination of the existing set of trees matches the target. If not, the scheduler starts again with\na new tree, initialized with the set of variables V. The n-gram and RNN strategies are learned in\nan incremental fashion, starting with simple target expressions (i.e. those of low degree k, such as\nij AAT ). Once solutions to these are found, they become training examples used to improve the\n\n(cid:80)\nstrategy, needed for tackling harder targets (e.g.(cid:80)\n\nij AAT A).\n\n6.1 Random Strategy\nThe random strategy involves no learning, thus assigns equal scores to all valid grammar rules,\nhence the scheduler randomly picks which expression to try at each step. For simple targets, this\nstrategy may succeed as the scheduler may stumble upon a match to the target within a reasonable\ntime-frame. But for complex target expressions of high degree k, the search space is huge and the\napproach fails.\n6.2 n-gram\nIn this strategy, we simply count how often subtrees of depth n occur in solutions to previously\nsolved targets. As the number of different subtrees of depth n is large, the counts become very\nsparse as n grows. Due to this, we use a weighted linear combination of the score from all depths\nup to n. We found an effective weighting to be 10k, where k is the depth of the tree.\n6.3 Recursive Neural Network\nSection 4.2 showed how to use an RNN to learn a continuous representation of grammar trees. Recall\nthat the RNN \u03c6 maps expressions to continuous vectors: \u03c6(S) \u2192 Rl. To build a search strategy from\nthis, we train a softmax layer on top of the RNN to predict which rule should be applied to the current\nexpression (or expressions, since some rules have two inputs), so that we match the target.\nFormally, we have two current branches b1 and b2 (each corresponding to an expression) and wish\nto predict the root operation r that joins them (e.g. .\u2217) from among the valid grammar rules (|r|\nin total). We \ufb01rst use the previously trained RNN to compute \u03c6(b1) and \u03c6(b2). These are then\npresented to a |r|-way softmax layer (whose weight matrix U is of size 2l \u00d7 |r|). If only one branch\nexists, then b2 is set to a \ufb01xed random vector. The training data for U comes from trees that give\nef\ufb01cient solutions to targets of lower degree k (i.e. simpler targets). Training of the softmax layer\nis performed by stochastic gradient descent. We use dropout [13] as the network has a tendency to\nover\ufb01t and repeat exactly the same expressions for the next value of k. Thus, instead of training on\nexactly \u03c6(b1) and \u03c6(b2), we drop activations as we propagate toward the top of the tree (the same\n\n6\n\n\ffraction for each depth), which encourages the RNN to capture more local structures. At test time,\nthe probabilities from the softmax become the scores used by the scheduler.\n7 Experiments\nWe \ufb01rst show results relating to the learned representation for symbolic expressions (Section 4.2).\nThen we demonstrate our framework discovering ef\ufb01cient identities. For brevity, the identities dis-\ncovered are listed in the supplementary material [29].\n7.1 Expression Classi\ufb01cation using Learned Representation\nTable 2 shows the accuracy of the RNN model on expressions of varying degree, ranging from k = 3\nto k = 6. The dif\ufb01culty of the task can be appreciated by looking at the examples in Fig. 1. The low\nerror rate of \u2264 5%, despite the use of a simple softmax classi\ufb01er, demonstrates the effectiveness of\nour learned representation.\n\nTest accuracy\nNumber of classes\nNumber of expressions\n\nDegree k = 6\nDegree k = 3\n100% \u00b1 0% 96.9% \u00b1 1.5% 94.7% \u00b1 1.0% 95.3% \u00b1 0.7%\n\nDegree k = 5\n\nDegree k = 4\n\n12\n126\n\n125\n1520\n\n970\n13038\n\n1687\n24210\n\nTable 2: Accuracy of predictions using our learned symbolic representation (averaged over 10 dif-\nferent initializations). As the degree increases tasks becomes more challenging, because number of\nclasses grows, and computation trees become deeper. However our dataset grows larger too (training\nuses 80% of examples).\n\ni,j(AAT )(cid:98)k/2(cid:99) for even k\n\n7.2 Ef\ufb01cient Identity Discovery\nIn our experiments we consider 5 different families of expressions, chosen to fall within the scope\nof our grammar rules:\n\n1. ((cid:80) AAT)k: A is an Rn\u00d7n matrix. The k-th term is(cid:80)\nand(cid:80)\ni,j(AAT )(cid:98)k/2(cid:99)A , for odd k. E.g. for k = 2 :(cid:80)\ni,j AAT ; for k = 3 :(cid:80)\ni,j AAT AAT etc. Naive evaluation is O(cid:0)kn3(cid:1).\nfor k = 4 :(cid:80)\n2. ((cid:80)(A. \u2217 A)AT)k: A is an Rn\u00d7n matrix and let B = A. \u2217 A. The k-th term is\n(cid:80)\ni,j(BAT )(cid:98)k/2(cid:99) for even k and(cid:80)\ni,j(BAT B)(cid:98)k/2(cid:99) , for odd k. E.g. for k = 2 :(cid:80)\nA)AT ; for k = 3 :(cid:80)\ni,j(A.\u2217 A)AT (A.\u2217 A); for k = 4 :(cid:80)\nNaive evaluation is O(cid:0)kn3(cid:1).\n3. Symk: Elementary symmetric polynomials. A is a vector in Rn\u00d71. For k = 1 :(cid:80)\ni 5. However,\nthe k = 5 solution was found by the RNN consistently faster than the random strategy (100 \u00b1 12 vs\n438 \u00b1 77 secs).\n\nFigure 3: Evaluation on four different families of expressions. As the degree k increases, we\nsee that the random strategy consistently fails but the learning approaches can still \ufb01nd solutions\n(i.e. p(Success) is non-zero). Best viewed in electronic form.\n\n# Terms \u2264 O(cid:0)n2(cid:1)\n# Terms \u2264 O(cid:0)n3(cid:1)\n\nk = 2\n\nk = 3\n\nk = 4\n\n39\n41\n\n171\n187\n\n687\n790\n\nk = 5\n2628\n3197\n\nk = 6\n9785\n10k+\n\nk = 7 and higher\n\nOut of memory\n\nTable 3: The number of possible expressions for different degrees k.\n\n8 Discussion\nWe have introduced a framework based on a grammar of symbolic operations for discovering math-\nematical identities. Through the novel application of learning methods, we have shown how the\nexploration of the search space can be learned from previously successful solutions to simpler ex-\npressions. This allows us to discover complex expressions that random or brute-force strategies\ncannot \ufb01nd (the identities are given in the supplementary material [29]).\nSome of the families considered in this paper are close to expressions often encountered in machine\nlearning. For example, dropout involves an exponential sum over binary masks, which is related to\nthe RBM-1 family. Also, the partition function of an RBM can be approximated by the RBM-2\nfamily. Hence the identities we have discovered could potentially be used to give a closed-form\nversion of dropout, or compute the RBM partition function ef\ufb01ciently (i.e. in polynomial time).\nAdditionally, the automatic nature of our system naturally lends itself to integration with compilers,\nor other optimization tools, where it could replace computations with ef\ufb01cient versions thereof.\nOur framework could potentially be applied to more general settings, to discover novel formulae in\nbroader areas of mathematics. To realize this, additional grammar rules, e.g. involving recursion or\ntrigonometric functions would be needed. However, this would require a more complex scheduler\nto determine when to terminate a given grammar tree. Also, it is surprising that a recursive neural\nnetwork can generate an effective continuous representation for symbolic expressions. This could\nhave broad applicability in allowing machine learning tools to be applied to symbolic computation.\nThe problem addressed in this paper involves discrete search within a combinatorially large space\n\u2013 a core problem with AI. Our successful use of machine learning to guide the search gives hope\nthat similar techniques might be effective in other AI tasks where combinatorial explosions are\nencountered.\nAcknowledgements\nThe authors would like to thank Facebook and Microsoft Research for their support.\n\n8\n\n2345678910111213141500.10.20.30.40.50.60.70.80.91kp(Success)((AAT))k RNN0.3RNN0.131\u2212gram2\u2212gram3\u2212gram4\u2212gram5\u2212gramRandom2345678910111213141500.10.20.30.40.50.60.70.80.91kp(Success)((A.*A)AT))k RNN0.3RNN0.131\u2212gram2\u2212gram3\u2212gram4\u2212gram5\u2212gramRandom2345678910111213141500.10.20.30.40.50.60.70.80.91kp(Success)Symk RNN0.3RNN0.131\u2212gram2\u2212gram3\u2212gram4\u2212gram5\u2212gramRandom2345678910111213141500.10.20.30.40.50.60.70.80.91kp(Success)(RBM-1)k RNN0.3RNN0.131\u2212gram2\u2212gram3\u2212gram4\u2212gram5\u2212gramRandom\fReferences\n[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n[2] L. Bottou. From machine learning to machine reasoning. Machine Learning, 94(2):133\u2013149, 2014.\n[3] S. R. Bowman. Can recursive neural tensor networks learn logical reasoning?\n\narXiv preprint\n\narXiv:1312.6192, 2013.\n\n[4] J. P. Bridge, S. B. Holden, and L. C. Paulson. Machine learning for \ufb01rst-order theorem proving. Journal\n\nof Automated Reasoning, 53:141\u2013172, August 2014.\n\n[5] C.-L. Chang. Symbolic logic and mechanical theorem proving. Academic Press, 1973.\n[6] B. W. Char, K. O. Geddes, G. H. Gonnet, B. L. Leong, M. B. Monagan, and S. M. Watt. Maple V library\n\nreference manual, volume 199. Springer-verlag New York, 1991.\n\n[7] G. Cheung and S. McCanne. An attribute grammar based framework for machine-dependent computa-\n\ntional optimization of media processing algorithms. In ICIP, volume 2, pages 797\u2013801. IEEE, 1999.\n\n[8] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[9] S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the third annual ACM\n\nsymposium on Theory of computing, pages 151\u2013158. ACM, 1971.\n\n[10] M. Desainte-Catherine and K. Barbar. Using attribute grammars to \ufb01nd solutions for musical equational\n\nprograms. ACM SIGPLAN Notices, 29(9):56\u201363, 1994.\n\n[11] M. Fitting. First-order logic and automated theorem proving. Springer, 1996.\n[12] N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz, and D. Tarlow. Church: a language for generative\n\nmodels. arXiv:1206.3255, 2012.\n\n[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov.\n\nnetworks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.\n\nImproving neural\n\n[14] D. E. Knuth. Semantics of context-free languages. Mathematical systems theory, 2(2):127\u2013145, 1968.\n[15] M.-T. Luong, R. Socher, and C. D. Manning. Better word representations with recursive neural networks\n\nfor morphology. In CoNLL, 2013.\n\n[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector\n\nspace. arXiv:1301.3781, 2013.\n\n[17] A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. In NIPS, 2009.\n[18] P. Nordin. Evolutionary program induction of binary machine code and its applications. Krehl Munster,\n\n1997.\n\n[19] M. ONeill, R. Cleary, and N. Nikolov. Solving knapsack problems with attribute grammars. In Proceed-\n\nings of the Third Grammatical Evolution Workshop (GEWS04). Citeseer, 2004.\n\n[20] A. Pfeffer. Practical probabilistic programming. In Inductive Logic Programming, pages 2\u20133. Springer,\n\n2011.\n\n[21] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in\n\ndeep linear neural networks. arXiv:1312.6120, 2013.\n\n[22] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syntactic\nparsing with recursive neural networks. Proceedings of the NIPS-2010 Deep Learning and Unsupervised\nFeature Learning Workshop, pages 1\u20139, 2010.\n\n[23] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Potts. Recursive deep\n\nmodels for semantic compositionality over a sentiment treebank. In EMNLP, 2013.\n\n[24] R. P. Stanley. Enumerative combinatorics. Number 49. Cambridge university press, 2011.\n[25] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-\n\nsupervised learning. In ACL, 2010.\n\n[26] J. Waldisp\u00a8uhl, B. Behzadi, and J.-M. Steyaert. An approximate matching algorithm for \ufb01nding (sub-)\n\noptimal sequences in s-attributed grammars. Bioinformatics, 18(suppl 2):S250\u2013S259, 2002.\n\n[27] S. Wolfram. The mathematica book, volume 221. Wolfram Media Champaign, Illinois, 1996.\n[28] M. L. Wong and K. S. Leung. Evolutionary program induction directed by logic grammars. Evolutionary\n\nComputation, 5(2):143\u2013180, 1997.\n\n[29] W. Zaremba, K. Kurach, and R. Fergus. Learning to discover ef\ufb01cient mathematical identities. arXiv\n\npreprint arXiv:1406.1584 (http://arxiv.org/abs/1406.1584), 2014.\n\n9\n\n\f", "award": [], "sourceid": 723, "authors": [{"given_name": "Wojciech", "family_name": "Zaremba", "institution": "New York University"}, {"given_name": "Karol", "family_name": "Kurach", "institution": "Google Zurich / University of Warsaw"}, {"given_name": "Rob", "family_name": "Fergus", "institution": "NYU"}]}