Review for NeurIPS paper: Unsupervised Translation of Programming Languages

NeurIPS 2020

Unsupervised Translation of Programming Languages

Review 1

Summary and Contributions: This paper applies the unsupervised machine translation method of Lample et al (and Artetxe et al) to the task of translating from one programming language to another.

Strengths: The main contribution of this work is just showing that this can be done, which is good to see. The results are good compared to a rule-based baseline. The other important contribution this work makes is to show that BLEU is not an appropriate evaluation metric, and a more extrinsic measure (% of test cases passed) is needed.

Weaknesses: Technically the method only makes small changes from the method of Lample et al., e.g., in the handling of tokenization.

Correctness: I would like to take issue with Table 2, which compares "Baselines" against what appear to be several comparable methods, like "Transcoder Beam 1" and "TransCoder Beam 5". The improvements that come from varying the beam size are huge (60.9 -> 70.7 going from "Beam 1" to "Beam 5"), which confused me initially. Two distinctions should be made more clearly. First, beam decoding and top-k decoding are not the same thing, although in the NMT world they are nearly the same thing and are often confused. In decoders that use dynamic programming (like the Viterbi algorithm for HMMs), the hypotheses in the beam at the last time step are not the top k hypotheses. Making this distinction more clearly would shed light on the second distinction, which is that varying the beam size is actually a variation in the evaluation metric, not the method, so that "Beam 1" scores are not directly comparable to "Beam 5" scores. Can I suggest instead calling these "CA@1", "CA@5", etc., to underscore that these are really different evaluation metrics? OTOH, "Beam 1" vs "Beam 10 - Top 1" is a direct comparison. I would say in this case that the methods being compared are "Beam 1" vs "Beam 10" and the metric is "CA@1". ETA: Thanks for your response; I'm glad you agree with these suggestions.

Clarity: Yes, except please see "Correctness" above.

Relation to Prior Work: The dependence of this work on Lample et al. is made very clear. The correct citation for Artetxe et al "Unsupervised statistical machine translation" is Proceedings of EMNLP 2018.

Reproducibility: Yes

Additional Feedback: The Broader Impact section focuses only on positives; aren't there potential negative consequences of relying on ML-generated code?

Review 2

Summary and Contributions: The paper proposes an unsupervised method for computer program to program transliteration using approaches from unsupervised machine translation literature. More specifically, the paper investigates transliteration of functions in programming languages such as C++, Java, or Python to each other.

Strengths: - The paper is able to obtain strong results in the challenging task of program transliteration using fully unsupervised learning approaches, which is the biggest strength of the paper. - Although these unsupervised methods of masked language model pre-training, denoising autoencoding, and backtranslation are already well-established and have are commonly used on mainstream NLP tasks, their successful application to program transliteration is first shown by this paper. - I feel that due to the strong results obtained the paper, the community will find the methods and results of the paper very useful in future work in the program transliteration field. This is a very nice piece of work. Based on the above strengths of the work, I recommend its acceptance.

Weaknesses: However, the paper has some weakness as well, which are detailed here: - There are no detailed results/comparisons with supervised approaches as baselines in the Results section. These comparisons would have helped to understand the limits of the unsupervised methods. One reason might be the lack of high-quality parallel corpora to train the models with, but nevertheless such results would be quite valuable in further improving the understanding of the limitations/strengths of unsupervised approaches. - The paper evaluates to translate functions with an average length of 110 tokens. It would have been good to have the results on longer functions, or a collections of functions in a file. - The paper doesn't provide ablation studies of the relative importance of different pre-training steps such as the importance of the masked LM using XLM pre-training step? Can this be substituted by the pretraining strategy of models like T5 or BART. In the case of T5 or BART, can the first two steps (masked LM and denoising auto-encoding) be combined into a common step?

Correctness: Yes, I find that the claims, methods, and empirical methodology to be sound and correct.

Clarity: Yes, the paper is very clear, easily understandable and well-written. Kudos to the authors!

Relation to Prior Work: Yes, the paper discusses prior work extensively both with respect to the task and approach-level.

Reproducibility: Yes

Additional Feedback: [UPDATE] In the rebuttal, the authors plan to revise the paper to include more experiments for ablation studies and comparisons with supervised baselines. Overall, I am positive about the paper and the usefulness of its results to the research community.

Review 3

Summary and Contributions: The paper presents a technique for unsupervised translation between programming languages. The paper borrows heavily from Lample et al. [32] where cross-lingual language models have been trained for translation between natural languages. It follows the same general flow for using large monolingual corpora of each language to obtain the NMT model. The paper presents TransCoder, a system for translation between programming languages based on monolingual source code. The system is evaluated over language pairs from (C++, Java, Python). TransCoder outperforms commercial rules-based baselines.

Strengths: - a simple approach, lifting results from NLP XLM to programming languages - outperform commercial rule-based translation systems - results are impressive - a thorough analysis of the results and sources of errors

Weaknesses: This is a nice experiment but I had hoped to learn more about what are the unique challenges for translation of programming languages and what had to be done to addresses these challenges. I guess there is just as much that you can fit into a single paper.

Correctness: Experimental evaluation is solid. Would love to see additional data regarding sensitivity to method length and robustness under changes to method names and signature.

Clarity: Paper is very well written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - Would have been great to see more discussion on what are the challenges of translation for programming languages vs. natural languages. What challenges are unique to source code? Just off the top of my head: + APIs and libraries + variable names + compositionality + purity / side effects + scoping rules + types - The problem of programming language translation is rarely the syntax of the language itself, but rather the semantic differences between the languages, and even worse, between their underlying libraries. You briefly touch on that in line 49, and also in Figure 10 (and in line 279) of the supplementary material. I had hoped for a more elaborate discussion of this point. - Robustness to variable renaming is a great experiment (Figure 8), do you have a more systematic experiment on the robustness that you can report numbers for? - What is the sensitivity of the model to method names? Looks like name and signature of the method would be critical for the translation to be successful? - As in other cases, the quality of translation probably degrades as method length increases? Can you share any numbers on that? - What makes translation easier between certain pairs of languages and harder between other pairs? - Have you tried translation between more foreign languages? Say C++ and Ocaml? - Have you tried translation between more nuanced language pairs, such as Python2 and Python3? - most of the examples are small algorithmic computations where the implementation is both common and relatively isolated (for example, max, sum_elements, no_letters from Figure 7). I wonder how the approach would work for functions that do not perform a specific isolated tasks, but combine several operations (e.g., load CSV from file and save it to a database). - Passing unit tests is unfortunately not a great metric either, but I agree that it is much better than reference match. Maybe consider more structured diff that allows to isolate subtrees of the AST where translation went wrong. UPDATE: Thank you for the thoughtful author response. Please include the experiments mentioned in the response in the final version of the paper.

Review 4

Summary and Contributions: This paper proposes to leverage recent approach in unsupervised machine translation to train a fully unsupervised neural transcompiler. The experimental results show that the proposed model can translate functions between C++, JAVA and Python. The authors also build and release a test set as parallel functions.

Strengths: The paper shows the proposed approach outperforms rule-based commercial baselines by a significant margin. The test set the authors share will be useful for the future research The paper proposes a new method to translate functions from a programming language to another, based on monolingual source code

Weaknesses: This is a fully unsupervised translation model, but I have interests in the case where we have a few supervised examples. How much important

Correctness: In Table 2, the performance gets better as the beam size increases. Did you keep increasing the beam size? How much is performance changed?

Clarity: The paper is mostly clearly written. Did you conduct ablation study of denoising auto-encoding and back-translation effect on this dataset?

Relation to Prior Work: None.

Reproducibility: Yes

Additional Feedback: UPDATE: Thank you for providing the authors' feedback.