{"title": "Tree-to-tree Neural Networks for Program Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 2547, "page_last": 2557, "abstract": "Program translation is an important tool to migrate legacy code in one language into an ecosystem built in a different language. In this work, we are the first to employ deep neural networks toward tackling this problem. We observe that program translation is a modular procedure, in which a sub-tree of the source tree is translated into the corresponding target sub-tree at each step. To capture this intuition, we design a tree-to-tree neural network to translate a source tree into a target one. Meanwhile, we develop an attention mechanism for the tree-to-tree model, so that when the decoder expands one non-terminal in the target tree, the attention mechanism locates the corresponding sub-tree in the source tree to guide the expansion of the decoder. We evaluate the program translation capability of our tree-to-tree model against several state-of-the-art approaches. Compared against other neural translation models, we observe that our approach is consistently better than the baselines with a margin of up to 15 points. Further, our approach can improve the previous state-of-the-art program translation approaches by a margin of 20 points on the translation of real-world projects.", "full_text": "Tree-to-tree Neural Networks for Program\n\nTranslation\n\nXinyun Chen\nUC Berkeley\n\nxinyun.chen@berkeley.edu\n\nChang Liu\nUC Berkeley\n\nliuchang2005acm@gmail.com\n\nDawn Song\nUC Berkeley\n\ndawnsong@cs.berkeley.edu\n\nAbstract\n\nProgram translation is an important tool to migrate legacy code in one language\ninto an ecosystem built in a different language. In this work, we are the \ufb01rst\nto employ deep neural networks toward tackling this problem. We observe that\nprogram translation is a modular procedure, in which a sub-tree of the source tree\nis translated into the corresponding target sub-tree at each step. To capture this\nintuition, we design a tree-to-tree neural network to translate a source tree into a\ntarget one. Meanwhile, we develop an attention mechanism for the tree-to-tree\nmodel, so that when the decoder expands one non-terminal in the target tree, the\nattention mechanism locates the corresponding sub-tree in the source tree to guide\nthe expansion of the decoder. We evaluate the program translation capability of our\ntree-to-tree model against several state-of-the-art approaches. Compared against\nother neural translation models, we observe that our approach is consistently better\nthan the baselines with a margin of up to 15 points. Further, our approach can\nimprove the previous state-of-the-art program translation approaches by a margin\nof 20 points on the translation of real-world projects.\n\n1\n\nIntroduction\n\nPrograms are the main tool for building computer applications, the IT industry, and the digital world.\nVarious programming languages have been invented to facilitate programmers to develop programs\nfor different applications. At the same time, the variety of different programming languages also\nintroduces a burden when programmers want to combine programs written in different languages\ntogether. Therefore, there is a tremendous need to enable program translation between different\nprogramming languages.\nNowadays, to translate programs between different programming languages, typically programmers\nwould manually investigate the correspondence between the grammars of the two languages, then\ndevelop a rule-based translator. However, this process can be inef\ufb01cient and error-prone. In this\nwork, we make the \ufb01rst attempt to examine whether we can leverage deep neural networks to build a\nprogram translator automatically.\nIntuitively, the program translation problem in its format is similar to a natural language translation\nproblem. Some previous work propose to adapt phrase-based statistical machine translation (SMT)\nfor code migration [21, 16, 22]. Recently, neural network approaches, such as sequence-to-sequence-\nbased models, have achieved the state-of-the-art performance on machine translation [4, 9, 13, 14, 30].\nIn this work, we study neural machine translation methods to handle the program translation problem.\nHowever, a big challenge making a sequence-to-sequence-based model ineffective is that, unlike\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fnatural languages, programming languages have rigorous grammars and are not tolerant to typos\nand grammatical mistakes. It has been demonstrated that it is very hard for an RNN-based sequence\ngenerator to generate syntactically correct programs when the lengths grow large [17].\nIn this work, we observe that the main issue of an RNN that makes it hard to produce syntactically\ncorrect programs is that it entangles two sub-tasks together: (1) learning the grammar; and (2) aligning\nthe sequence with the grammar. When these two tasks can be handled separately, the performance can\ntypically boost. For example, Dong et al. employ a tree-based decoder to separate the two tasks [11].\nIn particular, the decoder in [11] leverages the tree structural information to (1) generate the nodes at\nthe same depth of the parse tree using an LSTM decoder; and (2) expand a non-terminal and generate\nits children in the parse tree. Such an approach has been demonstrated to achieve the state-of-the-art\nresults on several semantic parsing tasks.\nInspired by this observation, we hypothesize that the structural information of both source and target\nparse trees can be leveraged to enable such a separation. Inspired by this intuition, we propose\ntree-to-tree neural networks to combine both a tree encoder and a tree decoder. In particular, we\nobserve that in the program translation problem, both source and target programs have their parse\ntrees. In addition, a cross-language compiler typically follows a modular procedure to translate the\nindividual sub-components in the source tree into their corresponding target ones, and then compose\nthem to form the \ufb01nal target tree. Therefore, we design the work\ufb02ow of a tree-to-tree neural network\nto align with this procedure: when the decoder expands a non-terminal, it locates the corresponding\nsub-tree in the source tree using an attention mechanism, and uses the information of the sub-tree to\nguide the non-terminal expansion. In particular, a tree encoder is helpful in this scenario, since it can\naggregate all information of a sub-tree to the embedding of its root, so that the embedding can be\nused to guide the non-terminal expansion of the target tree.\nWe follow the above intuition to design the tree-to-tree translation model. Some existing work [28, 18]\npropose tree-based autoencoder architectures. However, in these models, the decoder can only\naccess to a single hidden vector representing the source tree, thus they are not performant on\nthe translation task. In our evaluation, we demonstrate that without an attention mechanism, the\ntranslation performance is 0% in most cases, while using an attention mechanism could boost\nthe performance to > 90%. Another work [6] proposes a tree-based attentional encoder-decoder\narchitecture for natural language translation, but their model performs even worse than the attentional\nsequence-to-sequence baseline model. One main reason is that their attention mechanism calculates\nthe attention weights of each node independently, which does not well capture the hierarchical\nstructure of the parse trees. In our work, we design a parent attention feeding mechanism that\nformulates the dependence of attention maps between different nodes, and show that this attention\nmechanism further improves the performance of our tree-to-tree model considerably, especially\nwhen the size of the parse trees grows large (i.e., 20% \u2212 30% performance gain). To the best of\nour knowledge, this is the \ufb01rst successful demonstration of tree-to-tree neural network architecture\nproposed for translation tasks in the literature.\nTo test our hypothesis, we develop two novel program translation tasks, and employ a Java to C#\nbenchmark used by existing program translation works [22, 21]. First, we compare our approach\nagainst several neural network approaches on our proposed two tasks. Experimental results demon-\nstrate that our tree-to-tree model outperforms other state-of-the-art neural networks on the program\ntranslation tasks, and yields a margin of up to 5% on the token accuracy and up to 15% on the\nprogram accuracy. Further, we compare our approach with previous program translation approaches\non the Java to C# benchmark, and the results show that our tree-to-tree model outperforms previous\nstate-of-the-art by a large margin of 20% on program accuracy. These results demonstrate that our\ntree-to-tree model is promising toward tackling the program translation problem. Meanwhile, we\nbelieve that our proposed tree-to-tree neural network could also be adapted to other tree-to-tree tasks,\nand we consider it as future work.\n\n2 Program Translation Problem\n\nIn this work, we consider the problem of translating a program in one language into another. One\napproach is to model the problem as a machine translation problem between two languages, and thus\nnumerous neural machine translation approaches can be applied.\n\n2\n\n\fFigure 1: Translating a CoffeeScript program into JavaScript. The sub-component in the CoffeeScript\nprogram and its corresponding translation in JavaScript are highlighted.\n\nFor the program translation problem, however, a unique property is that each input program unam-\nbiguously corresponds to a unique parse tree. Thus, rather than modeling the input program as a\nsequence of tokens, we can consider the problem as translating a source tree into a target tree. Note\nthat most modern programming languages are accompanied with a well-developed parser, so we can\nassume that the parse trees of both the source and the target programs can be easily obtained.\nThe main challenge of the problem in our consideration is that the cross-compiler for translating\nprograms typically does not exist. Therefore, even if we assume the existence of parsers for both the\nsource and the target languages, the translation problem itself is still non-trivial. We formally de\ufb01ne\nthe problem as follows.\nDe\ufb01nition 1 (Program translation). Given two programming languages Ls and Lt, each being a set\nof instances (pk, Tk), where pk is a program, and Tk is its corresponding parse tree. We assume that\nthere exists a translation oracle \u03c0, which maps instances in Ls to instances in Lt. Given a dataset of\ninstance pairs (is, it) such that is \u2208 Ls, it \u2208 Lt and \u03c0(is) = it, our problem is to learn a function\nF that maps each is \u2208 Ls into it = \u03c0(is).\n\nIn this work, we focus on the problem setting that we have a set of paired source and target programs\nto learn the translator. Note that all existing program translation works [16, 22, 21] also study the\nproblem under such an assumption. When such an alignment is lacking, the program translation\nproblem is more challenging. Several techniques for NMT have been proposed to handle this issue,\nsuch as dual learning [14], which have the potential to be extended for the program translation task.\nWe leave these more challenging problem setups as future work.\n\n3 Tree-to-tree Neural Network\n\nIn this section, we present our design of the tree-to-tree neural network. We \ufb01rst motivate the design,\nand then present the details.\n\n3.1 Program Translation as a Tree-to-tree Translation Problem\n\nFigure 1 presents an example of translation from CoffeeScript to JavaScript. We observe that an\ninteresting property of the program translation problem is that the translation process can be modular.\nThe \ufb01gure highlights a sub-component in the source tree corresponding to x=1 and its translation\nin the target tree corresponding to x=1;. This correspondence is independent of other parts of the\nprogram. Consider when the program grows longer and this statement may repetitively occur multiple\ntimes, it may be hard for a sequence-to-sequence model to capture the correspondence based on\nonly token sequences without structural information. Thus, such a correspondence makes it a natural\nsolution to locate the referenced sub-tree in the source tree when expanding a non-terminal in the\ntarget tree into a sub-tree.\n\n3\n\nCoffeeScriptProgram: x=1 if y==0JavaScript Program: if (y === 0) { x = 1; }BlockIfOp===ValueIdentifierLiteralyValueNumberLiteral0AssignValueIdentifierLiteralxValueNumberLiteral1BlockProgramIfStatementBinaryExpressionIdentifiery===Literal0BlockStatementExpressionStatementAssignExpressionIdentifierx=Literal1Parse TreeParse Tree\fFigure 2: Tree-to-tree work\ufb02ow: The arrows indicate the computation \ufb02ow. Blue solid arrows indicate\nthe \ufb02ow from/to the left child, while orange dashed arrows are for the right child. The black dotted\narrow from the source tree root to the target tree root indicates that the LSTM state is copied. The\ngreen box denotes the expanding node, and the grey one denotes the node to be expanded in the\nqueue. The sub-tree of the source tree corresponding to the expanding node is highlighted in yellow.\nThe right corner lists the formulas to predict the value of the expanding node.\n\n3.2 Tree-to-tree Neural Network\n\nInspired by the above motivation, we design the tree-to-tree neural network, which follows an\nencoder-decoder framework to encode the source tree into an embedding, and decode the embedding\ninto the target tree. To capture the intuition of the modular translation process, the decoder employs\nan attention mechanism to locate the corresponding source sub-tree when expanding the non-terminal.\nWe illustrate the work\ufb02ow of a tree-to-tree model in Figure 2, and present each component of the\nmodel below.\n\nConverting a tree into a binary one. Note that the source and target trees may contain multiple\nbranches. Although we can design tree-encoders and tree-decoders to handle trees with arbitrary\nnumber of branches, we observe that encoder and decoder for binary trees can be more effective.\nThus, the \ufb01rst step is to convert both the source tree and the target tree into a binary tree. To this end,\nwe employ the Left-Child Right-Sibling representation for this conversion.\n\nBinary tree encoder. The encoder employs a Tree-LSTM [29] to compute embeddings for both the\nentire source tree and each of its sub-tree. In particular, consider a node N with the value ts in its\none-hot encoding representation, and it has two children NL and NR, which are its left child and\nright child respectively. The encoder recursively computes the embedding for N from the bottom up.\nAssume that the left child and the right child maintain the LSTM state (hL, cL) and (hR, cR)\nrespectively, and the embedding of ts is x. Then the LSTM state (h, c) of N is computed as\n\n(h, c) = LSTM(([hL; hR], [cL; cR]), x)\n\n(1)\n\nwhere [a; b] denotes the concatenation of a and b. Note that a node may lack one or both of its\nchildren. In this case, the encoder sets the LSTM state of the missing child to be zero.\n\nBinary tree decoder. The decoder generates the target tree starting from a single root node. The\ndecoder \ufb01rst copies the LSTM state (h, c) of the root of the source tree, and attaches it to the root node\nof the target tree. Then the decoder maintains a queue of all nodes to be expanded, and recursively\nexpands each of them. In each iteration, the decoder pops one node from the queue, and expands it.\nIn the following, we call the node being expanded the expanding node.\nFirst, the decoder will predict the value of expanding node. To this end, the decoder computes the\nembedding et of the expanding node N, and then feeds it into a softmax regression network for\nprediction:\n\n(2)\nHere, W is a trainable matrix of size Vt \u00d7 d, where Vt is the vocabulary size of the outputs and d is\nthe embedding dimension. Note that et is computed using the attention mechanism, which we will\nexplain later.\n\ntt = argmax softmax(W et)\n\n4\n\nBlockIfOp===ValueIdentifierLiteralyValueNumberLiteral0AssignValueIdentifierLiteralxValueNumberLiteral1BlockProgramIfStatementBinaryExpressionExpandingTo ExpandSource TreeTarget Tree\u210e1\u210e2\u210e8\u210e7\u210e6\u210e5\u210e4\u210e3\u210e14\u210e13\u210e12\u210e11\u210e10\u210e9\u210e17\u210e16\u210e15\u210e1\u2032\u210e2\u2032\u210e3\u2032\u210e4\u2032\u210e5\u2032Attention map: \ud835\udc64\ud835\udc56\u221dexp(\u210e\ud835\udc56\ud835\udc47\ud835\udc4a0\u210e5\u2032)Source embedding: \ud835\udc52\ud835\udc60=\u2211\ud835\udc56=117\ud835\udc64\ud835\udc56\u210e\ud835\udc56=\u210e1;\u2026;\u210e17\ud835\udc64Combined embedding: \ud835\udc52\ud835\udc61=tanh(\ud835\udc4a1\ud835\udc52\ud835\udc60+\ud835\udc4a2\u210e5\u2032)Predicting the node: node=argmax\ud835\udc2c\ud835\udc28\ud835\udc1f\ud835\udc2d\ud835\udc26\ud835\udc1a\ud835\udc31(\ud835\udc4aet)\fThe value of each node tt is a non-terminal, a terminal, or a special (cid:104)EOS(cid:105) token. If tt = (cid:104)EOS(cid:105),\nthen the decoder \ufb01nishes expanding this node. Otherwise, the decoder generates one new node as the\nleft child and another new node as the right child of the expanding one. Assume that (h(cid:48), c(cid:48)), (h(cid:48)(cid:48), c(cid:48)(cid:48))\nare the LSTM states of its left child and right child respectively, then they are computed as:\n\n(h(cid:48), c(cid:48)) = LSTML((h, c), Btt)\n(h(cid:48)(cid:48), c(cid:48)(cid:48)) = LSTMR((h, c), Btt)\n\n(3)\n(4)\nHere, B is a trainable word embedding matrix of size d \u00d7 Vt. Note that the generation of the left\nchild and right child use two different sets of parameters for LSTML and LSTMR respectively. These\nnew children are pushed into the queue of all nodes to be expanded. When the queue is empty, the\ntarget tree generation process terminates.\nNotice that although the sets of terminal and non-terminal are disjoint, it is necessary to include the\n(cid:104)EOS(cid:105) token for the following reasons. First, due to the left-child-right-sibling encoding, although a\nterminal does not have a child, since it could have a right child representing its sibling in the original\ntree, (cid:104)EOS(cid:105) is still needed for predicting the right branch. Meanwhile, we combine the terminal and\nnon-terminal sets into a single vocabulary Vt for the decoder, and do not incorporate the knowledge\nof grammar rules into the model, thus the model needs to infer whether a predicted token is a terminal\nor a non-terminal itself. In our evaluation, we \ufb01nd that a well-trained model never generates a left\nchild for a terminal, which indicates that the model can learn to distinguish between terminals and\nnon-terminals correctly.\n\nAttention mechanism to locate the source sub-tree. Now we consider how to compute et. One\nstraightforward approach is to compute et as h, which is the hidden state attached to the expanding\nnode. However, in doing so, the embedding will soon forget the information about the source tree\nwhen generating deep nodes in the target tree, and thus the model yields a very poor performance.\nTo make better use of the information of the source tree, our tree-to-tree model employs an attention\nmechanism to locate the source sub-tree corresponding to the sub-tree rooted at the expanding node.\nSpeci\ufb01cally, we compute the following probability:\n\nP (Ns is the source sub-tree corresponding to Nt|Nt)\n\nwhere Nt is the expanding node. We denote this probability as P (Ns|Nt), and we compute it as\n\nP (Ns|Nt) \u221d exp(hT\n\ns W0ht)\n\n(5)\n\nwhere W0 is a trainable matrix of size d \u00d7 d.\nTo leverage the information from the source tree, we compute the expectation of the hidden state\nvalue across all Ns conditioned on Nt, i.e.,\n\nes = E[hNs|Nt] =\n\nhNs \u00b7 P (Ns|Nt)\n\n(6)\n\n(cid:88)\n\nNs\n\n(7)\n\nThis embedding can then be combined with h, the hidden state of the expanding node, to compute et\nas follows:\net = tanh(W1es + W2h)\nwhere W1, W2 are trainable matrices of size d \u00d7 d respectively.\nParent attention feeding. In the above approach, the attention vectors et are computed independently\nto each other, since once et is used for predicting the node value tt, et is no longer used for further\npredictions. However, intuitively, the attention decisions for the prediction of each node should be\nrelated to each other. For example, for a non-terminal node Nt in the target tree, suppose that it is\nrelated to Ns in the source tree, then it is very likely that the attention weights of its children should\nfocus on the descendants of Ns. Therefore, when predicting the attention vector of a node, the model\nshould leverage the attention information of its parent as well.\nFollowing this intuition, we propose a parent attention feeding mechanism, so that the attention vector\nof the expanding node is taken into account when predicting the attention vectors of its children.\nFormally, besides the embedding of the node value tt, we modify the inputs to LSTML and LSTMR\nof the decoder in Equations (3) and (4) as below:\n\n5\n\n\f(h(cid:48), c(cid:48)) = LSTML((h, c), [Btt; et])\n(h(cid:48)(cid:48), c(cid:48)(cid:48)) = LSTMR((h, c), [Btt; et])\n\n(8)\n(9)\n\nNotice that these formulas in their formats coincide with the input-feeding method for sequential\nneural networks [20], but their meanings are different. For sequential models, the input attention\nvector belongs to the previous token, while here it belongs to the parent node. In our evaluation, we\nwill show that such a parent attention feeding mechanism signi\ufb01cantly improves the performance of\nour tree-to-tree model.\n\n4 Evaluation\n\nIn this section, we evaluate our tree-to-tree neural network with several baseline approaches on the\nprogram translation task. To do so, we \ufb01rst describe three benchmark datasets in Section 4.1 for eval-\nuating different aspects; then we evaluate our tree-to-tree model against several baseline approaches,\nincluding the state-of-the-art neural network approaches and program translation approaches.\n\n4.1 Datasets\n\nTo evaluate different approaches for the program translation problem, we employ three tasks: (1)\na synthetic translation task from an imperative language to a functional language; (2) translation\nbetween CoffeeScript and JavaScript, which are both full-\ufb02edged languages; and (3) translation of\nreal-world projects from Java to C#, which has been used as a benchmark in the literature. Due to the\nspace limit, we present the translation tasks of real-world programming languages (i.e., task (2) and\n(3)) below, and we discuss the synthetic task in the supplementary material.\nFor the CoffeeScript-JavaScript task, CoffeeScript employs a Python-style succinct syntax, while\nJavaScript employs a C-style verbose syntax. To control the program lengths of the training and test\ndata, we develop a pCFG-based program generator and a subset of the core CoffeeScript grammar.\nWe also limit the set of variables and literals to restrict the vocabulary size. We utilize the CoffeeScript\ncompiler to generate the corresponding ground truth JavaScript programs. The grammar used to\ngenerate the programs in our experiments can be found in the supplementary material. In doing so,\nwe obtain a set of CoffeeScript-JavaScript pairs, and thus we can build a CoffeeScript-to-JavaScript\ndataset, and a JavaScript-to-CoffeeScript dataset by exchanging the source and the target. To build\nthe dataset, we randomly generate 100,000 pairs of source and target programs for training, 10,000\npairs as the development set, and 10,000 pairs for testing. We guarantee that there is no overlap\namong training, development and test sets, and all samples are unique in the dataset. More statistics\nof the dataset can be found in the supplementary material.\nFor the evaluation on Java to C#, we tried to contact the authors of [22] for their dataset, but our emails\nwere not responded. Thus, we employ the same approach as in [22] to crawl several open-source\nprojects, which have both a Java and a C# implementation. Same as in [22], we pair the methods in\nJava and C# based on their \ufb01le names and method names. The statistics of the dataset is summarized\nin the supplementary material. Due to the change of the versions of these projects, the concrete\ndataset in our evaluation may differ from [22]. For each project, we apply ten-fold validation on\nmatched method pairs, as in [22].\n\n4.2 Metrics\n\nThe main metric evaluated in our evaluation is the program accuracy, which is the percentage of the\npredicted target programs that are exactly the same as the ground truth in the dataset. Note that the\nprogram accuracy is an underestimation of the true accuracy based on semantic equivalence, and this\nmetric has been used in [22]. This metric is more meaningful than other previously proposed metrics,\nsuch as syntax-correctness and dependency-graph-accuracy, which are not directly comparable to\nsemantic equivalence. We also measure another metric called token accuracy, and we defer the details\nto the supplementary material.\n\n6\n\n\fT\u2192T\n\nTree2tree\nT\u2192T\n(-PF)\n\nT\u2192T\n(-Attn)\n\nSeq2seq\n\nP\u2192T\n\nP\u2192P\nT\u2192T\nCoffeeScript to JavaScript translation\n\nT\u2192P\n\nSeq2tree\n\nP\u2192T\n\nT\u2192T\n\nTree2seq\n\nT\u2192P\n\nT\u2192T\n\n99.57% 98.80% 0.09% 90.51% 79.82% 92.73% 89.13% 86.52% 88.50% 96.96% 92.18%\nCJ-AS\n99.75% 99.67%\n97.44% 16.26% 98.05% 93.89% 91.97% 88.22% 96.83% 78.77%\nCJ-BS\nCJ-AL 97.15% 71.52%\n21.04%\n80.82% 78.60% 82.55% 46.94%\n95.60% 78.61%\nCJ-BL\n19.26% 9.98% 25.35% 42.08% 76.12% 76.21% 83.61% 26.83%\n\n0%\n0%\n0%\n\n0%\n\n0%\n\n0%\n\n87.75% 85.11% 0.09% 83.07% 86.13% 73.88% 86.31% 86.86% 86.99% 71.61% 86.53%\nJC-AS\n86.37% 80.35%\n80.49% 85.94% 69.77% 85.28% 85.06% 84.25% 66.82% 85.31%\nJC-BS\nJC-AL 78.59% 54.93%\n77.10% 77.30% 65.52% 75.70% 77.11% 77.59% 60.75% 75.75%\n75.62% 44.40%\nJC-BL\n73.14% 73.96% 61.92% 74.51% 74.34% 71.56% 57.09% 73.86%\n\n0%\n0%\n0%\n\nJavaScript to CoffeeScript translation\n\nTable 1: Program accuracy for the translation between CoffeeScript and JavaScript.\n\n4.3 Model Details\n\nWe evaluate our tree-to-tree model against a sequence-to-sequence model [4, 31], a sequence-to-tree\nmodel [11], and a tree-to-sequence model [13]. Note that for a sequence-to-sequence model, there\ncan be four variants to handle different input-output formats. For example, given a program, we can\nsimply tokenize it into a sequence of tokens. We call this format as raw program, denoted as P. We\ncan also use the parser to parse the program into a parse tree, and then serialize the parse tree as a\nsequence of tokens. Our serialization of a tree follows its depth-\ufb01rst traversal order, which is the same\nas [31]. We call this format as parse tree, denoted as T. For both input and output formats, we can\nchoose either P or T. For a sequence-to-tree model, we have two variants based on its input format\nbeing either P or T; note that the sequence-to-tree model generates a tree as output, and thus requires\nits output format to be T (unserialized). Similarly, the tree-to-sequence model has two variants, and\nour tree-to-tree only has one form. Therefore, we have 9 different models in our evaluation.\nThe hyper-parameters used in different models can be found in the supplementary material. The\nbaseline models have employed their own input-feeding or parent-feeding method that is analogous\nto our parent attention feeding mechanism.\n\n4.4 Results on the CoffeeScript-JavaScript Task\n\nFor the CoffeeScript-JavaScript task, we create several datasets named as XY-ZW: X and Y (C or\nJ) indicate the source and target languages respectively; Z (A or B) indicates the vocabulary; and\nW (S or L) indicates the program length. In particular, vocabulary A uses {x,y} as variable names\nand {0,1} as literals; vocabulary B uses all alphabetical characters as variable names, and all single\ndigits as literals. S means that the CoffeeScript programs has 10 tokens on average; and L for 20.\nThe program accuracy results are presented in Table 1. We can observe that our tree2tree model\noutperforms all baseline models on all datasets. Especially, on the dataset with longer programs, the\nprogram accuracy signi\ufb01cantly outperforms all seq2seq models by a large margin, i.e., up to 75%.\nIts margin over a seq2tree model can also reach around 20 points. These results demonstrate that\ntree2tree model is more capable of learning the correspondence between the source and the target\nprograms; in particular, it is signi\ufb01cantly better than other baselines at handling longer inputs.\nMeanwhile, we perform an ablation study to compare the full tree2tree model with (1) tree2tree\nwithout parent attention feeding (T\u2192T (-PF)) and (2) tree2tree without attention (T\u2192T (-Attn)). We\nobserve that the full tree2tree model signi\ufb01cantly outperforms the other alternatives. In particular, on\nJC-BL, the full tree2tree\u2019s program accuracy is 30 points higher than the tree2tree model without\nparent attention feeding.\nMore importantly, we observe that the program accuracy of tree2tree model without the attention\nmechanism is nearly 0%. Note that such a model is similar to a tree-to-tree autoencoder architecture.\nThis result shows that our novel architecture can signi\ufb01cantly outperform previous tree-to-tree-like\narchitectures on the program translation task.\nHowever, although our tree2tree model performs better than other baselines, it still could not achieve\n100% accuracy. After investigating into the prediction, we \ufb01nd that the main reason is because the\ntranslation may introduce temporary variables. Because such temporary variables appear very rarely\nin the training set, it could be hard for a neural network to infer correctly in these cases. Actually,\n\n7\n\n\fTree2tree\n72.8%\n72.2%\n67.5%\n68.7%\n68.2%\n\n31.9% (58.3%)\n\nLucene\n\nPOI\nItext\nJGit\nJTS\nAntlr\n\nJ2C#\n\n1pSMT mppSMT\n\nReported in [22]\n\n21.5% 21.6%\n18.9% 34.6%\n25.1% 24.4%\n10.7% 23.0%\n11.7% 18.5%\n10.0% 11.5%\n\n40.0%\n48.2%\n40.6%\n48.5%\n26.3%\n49.1%\n\nTable 2: Program accuracy on the Java to C# translation. In the parentheses, we present the program\naccuracy that can be achieved by increasing the training set.\n\nthe longer the programs are, the more temporary variables that the cross-compiler may introduce,\nwhich makes the prediction harder. We consider further improving the model to handle this problem\nas future work.\nIn addition, we observe that for the translation from JavaScript to CoffeeScript, the improvements of\nthe tree2tree model over the baselines are much smaller than for CoffeeScript to JavaScript translation.\nWe attribute this to the fact that the target programs are much shorter. For example, for a CoffeeScript\nprogram with 20 tokens, its corresponding JavaScript program may contain more than 300 tokens.\nThus, the model needs to predict much fewer tokens for a CoffeeScript program than a JavaScript\nprogram, so that even seq2seq models can achieve a reasonably good accuracy. However, still, we\ncan observe that our tree2tree model outperforms all baselines.\n\n4.5 Results on Real-world Projects\n\nWe now compare our approach with three state-of-the-art program translation approaches, i.e.,\nJ2C# [15], 1pSMT [21], and mppSMT [22], on the real-world benchmark from Java to C#. Here,\nJ2C# is a rule-based system, 1pSMT directly applies the phrase-based SMT on sequential programs,\nand mppSMT is a multi-phase phrase-based SMT approach that leverages both the raw programs and\ntheir parse trees.\nThe results are summarized in Table 2. For previous approaches, we report the results from [22]. We\ncan observe that our tree2tree approach can signi\ufb01cantly outperform the previous state-of-the-art on\nall projects except Antlr. The improvements range from 20.2% to 41.9%.\nOn Antlr, the tree2tree model performs worse. We attribute this to the fact that Antlr contains too few\ndata samples for training. We test our hypothesis by constructing another training and validation set\nfrom all other 5 projects, and test our model on the entire Antlr. We observe that our tree2tree model\ncan achieve a test accuracy of 58.3%, which is 9 points higher than the state-of-the-art. Therefore,\nwe conclude that our approach can signi\ufb01cantly outperform previous program translation approaches\nwhen there are suf\ufb01cient training data.\n\n5 Related Work\n\nStatistical approaches for program translation. Some recent work have applied statistical machine\ntranslation techniques to program translation [2, 16, 22, 21, 23, 24]. For example, several works\npropose to adapt phrase-based statistical machine translation models and leverage grammatical\nstructures of programming languages for code migration [16, 22, 21]. In [23], Nguyen et al. propose\nto use Word2Vec representation for APIs in libraries used in different programming languages, then\nlearn a transformation matrix for API mapping. On the contrary, our work is the \ufb01rst to employ deep\nlearning techniques for program translation.\n\nNeural networks with tree structures. Recently, various neural networks with tree structures have\nbeen proposed to employ the structural information of the data [11, 26, 25, 32, 3, 29, 34, 27, 13,\n33, 28, 18, 6]. In these work, different tree-structured encoders are proposed for embedding the\ninput data, and different tree-structured decoders are proposed for predicting the output trees. In\nparticular, in [28, 18], they propose tree-structured autoencoders to learn vector representations of\ntrees, and show better performance on tree reconstruction and other tasks such as sentiment analysis.\nAnother work [6] proposes to use a tree-structured encoder-decoder architecture for natural language\n\n8\n\n\ftranslation, where both the encoder and the decoder are variants of the RNNG model [12]; however,\nthe performance of their model is slightly worse than the sequence-to-sequence model with attention,\nwhich is mainly due to the fact that their attention mechanism can not condition the future attention\nweights on previously computed ones. In this work, we are the \ufb01rst to demonstrate a successful\ndesign of tree-to-tree neural network for translation tasks.\n\nNeural networks for parsing. Other work study using neural networks to generate parse trees from\ninput-output examples [11, 31, 1, 26, 32, 3, 12, 8, 7]. In [11], Dong et al. propose a seq2tree model\nthat allows the decoder RNN to generate the output tree recursively in a top-down fashion. This\napproach achieves the state-of-the-art results on several semantic parsing tasks. Some other work\nincorporate the knowledge of the grammar into the architecture design [32, 26] to achieve better\nperformance on speci\ufb01c tasks. However, these approaches are hard to generalize to other tasks. Again,\nnone of them is designed for program translation or proposes a tree-to-tree architecture.\n\nNeural networks for code generation. A recent line of research study using neural networks for\ncode generation [5, 10, 25, 19, 26, 32]. In [19, 26, 32], they study generating code in a DSL from\ninputs in natural language or in another DSL. However, their designs require additional manual efforts\nto adapt to new DSLs in consideration. In our work, we consider the tree-to-tree model as a generic\napproach that can be applied to any grammar.\n\n6 Conclusion and Future Work\n\nIn this work, we are the \ufb01rst to consider neural network approaches for the program translation\nproblem, and are the \ufb01rst to demonstrate a successful design of tree-to-tree neural network combining\nboth a tree-RNN encoder and a tree-RNN decoder for translation tasks. Extensive evaluation\ndemonstrates that our tree-to-tree neural network outperforms several state-of-the-art models. This\nrenders our tree-to-tree model as a promising tool toward tackling the program translation problem.\nIn addition, we believe that our proposed tree-to-tree neural network has the potential to generalize to\nother tree-to-tree tasks, and we consider it as future work.\nAt the same time, we observe many challenges in program translation that existing techniques are\nnot capable of handling. For example, the models are hard to generalize to programs longer than\nthe training ones; it is unclear how to handle an in\ufb01nite vocabulary set that may be employed in\nreal-world applications; further, the training requires a dataset of aligned input-output pairs, which\nmay be lacking in practice. We consider all these problems as important future work in the research\nagenda toward solving the program translation problem.\n\nAcknowledgement\n\nWe thank the anonymous reviewers for their valuable comments. This material is in part based\nupon work supported by the National Science Foundation under Grant No. TWC-1409915, Berkeley\nDeepDrive, and DARPA D3M under Grant No. FA8750-17-2-0091. Any opinions, \ufb01ndings, and\nconclusions or recommendations expressed in this material are those of the author(s) and do not\nnecessarily re\ufb02ect the views of the National Science Foundation.\n\nReferences\n[1] R. Aharoni and Y. Goldberg. Towards string-to-tree neural machine translation. In ACL, 2017.\n[2] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. A survey of machine learning for big code\n\nand naturalness. arXiv preprint arXiv:1709.06182, 2017.\n\n[3] D. Alvarez-Melis and T. S. Jaakkola. Tree-structured decoding with doubly-recurrent neural\n\nnetworks. In ICLR, 2017.\n\n[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In ICLR, 2015.\n\n[5] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to\n\nwrite programs. In ICLR, 2017.\n\n9\n\n\f[6] J. Bradbury and R. Socher. Towards neural machine translation with latent tree attention. arXiv\n\npreprint arXiv:1709.01915, 2017.\n\n[7] X. Chen, C. Liu, E. C. Shin, D. Song, and M. Chen. Latent attention for if-then program\n\nsynthesis. In Advances in Neural Information Processing Systems, pages 4574\u20134582, 2016.\n\n[8] X. Chen, C. Liu, and D. Song. Towards synthesizing complex programs from input-output\n\nexamples. In ICLR, 2018.\n\n[9] S. J. K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural\n\nmachine translation. In ACL, 2015.\n\n[10] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli. Robust\ufb01ll: Neural\n\nprogram learning under noisy I/O. arXiv preprint arXiv:1703.07469, 2017.\n\n[11] L. Dong and M. Lapata. Language to logical form with neural attention. In ACL, 2016.\n\n[12] C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith. Recurrent neural network grammars. In\n\nNAACL, 2016.\n\n[13] A. Eriguchi, K. Hashimoto, and Y. Tsuruoka. Tree-to-sequence attentional neural machine\n\ntranslation. In ACL, 2016.\n\n[14] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine\n\ntranslation. In Advances in Neural Information Processing Systems, pages 820\u2013828, 2016.\n\n[15] Java2CSharp. Java2csharp. http://sourceforge.net/projects/j2cstranslator/, 2018.\n\n[16] S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based statistical translation of programming\nlanguages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New\nParadigms, and Re\ufb02ections on Programming & Software, pages 173\u2013184. ACM, 2014.\n\n[17] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks.\n\narXiv preprint arXiv:1506.02078, 2015.\n\n[18] M. J. Kusner, B. Paige, and J. M. Hern\u00b4andez-Lobato. Grammar variational autoencoder. arXiv\n\npreprint arXiv:1703.01925, 2017.\n\n[19] W. Ling, E. Grefenstette, K. M. Hermann, T. Ko\u02c7cisk`y, A. Senior, F. Wang, and P. Blunsom.\n\nLatent predictor networks for code generation. In ACL, 2016.\n\n[20] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine\ntranslation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, pages 1412\u20131421, 2015.\n\n[21] A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Lexical statistical machine translation for\nlanguage migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software\nEngineering, pages 651\u2013654. ACM, 2013.\n\n[22] A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Divide-and-conquer approach for multi-phase\nstatistical migration for source code (t). In Automated Software Engineering (ASE), 2015 30th\nIEEE/ACM International Conference on, pages 585\u2013596. IEEE, 2015.\n\n[23] T. D. Nguyen, A. T. Nguyen, and T. N. Nguyen. Mapping api elements for code migration with\nvector representations. In Software Engineering Companion (ICSE-C), IEEE/ACM International\nConference on, pages 756\u2013758. IEEE, 2016.\n\n[24] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. Learning to generate\npseudo-code from source code using statistical machine translation (t). In Automated Software\nEngineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 574\u2013584. IEEE,\n2015.\n\n[25] E. Parisotto, A.-r. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. Neuro-symbolic program\n\nsynthesis. In ICLR, 2017.\n\n10\n\n\f[26] M. Rabinovich, M. Stern, and D. Klein. Abstract syntax networks for code generation and se-\nmantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), volume 1, pages 1139\u20131149, 2017.\n\n[27] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language\nwith recursive neural networks. In Proceedings of the 28th international conference on machine\nlearning (ICML-11), pages 129\u2013136, 2011.\n\n[28] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursive\nautoencoders for predicting sentiment distributions. In Proceedings of the conference on empiri-\ncal methods in natural language processing, pages 151\u2013161. Association for Computational\nLinguistics, 2011.\n\n[29] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured\nlong short-term memory networks. In Proceedings of the Annual Meeting of the Association for\nComputational Linguistics, 2015.\n\n[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems,\npages 6000\u20136010, 2017.\n\n[31] O. Vinyals, \u0141. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign\n\nlanguage. In NIPS, 2015.\n\n[32] P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In ACL,\n\n2017.\n\n[33] X. Zhang, L. Lu, and M. Lapata. Top-down tree long short-term memory networks.\n\nProceedings of NAACL-HLT, pages 310\u2013320, 2016.\n\n[34] X. Zhu, P. Sobihani, and H. Guo. Long short-term memory over recursive structures.\n\nInternational Conference on Machine Learning, pages 1604\u20131612, 2015.\n\nIn\n\nIn\n\n11\n\n\f", "award": [], "sourceid": 1277, "authors": [{"given_name": "Xinyun", "family_name": "Chen", "institution": "UC Berkeley"}, {"given_name": "Chang", "family_name": "Liu", "institution": "Citadel"}, {"given_name": "Dawn", "family_name": "Song", "institution": "UC Berkeley"}]}