Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The setting in which the paper operates is given a compiler T, compiler options O and output machine code M of a function to find source code S that compiles to M using T and O. This is not a particularly practically compelling case, because it usually breaks with compiler optimizations, but still it is a good start. The main idea of the paper, however, is an elegant process of first generating a "sketch" program which may not necessarily be correct on the first shot, but is then iterated until it is correct. Because the compiler is present as an oracle, it is always possible to check if the correctness requirements are satisfied. This process is a good idea, because the translation process from machine to source code needs to be always right, but a one-shot neural generation procedure is unlikely to capture the entire semantics properly and will inevitably output some errors. I expect the of numerical (or string) constants would be difficult to capture (Q1).
The paper is thought-provoking -- especially in the way it identifies errors and then uses brute-force search to look for solutions that minimize distance in the compilation space. The method is not pretty, as the brute-force search through the high level language space could certainly be improved. The paper throughout is frequently unclear and confusing to read. How do the two LSTMs for the left and right subtree children work? How exactly is the error prediction module trained? The paper makes a distinction between functionally preserving the code, and also semantically preserving the code. In this context, program language semantics has a distinct meaning -- another word should be used and this concept clarified. There are too many acronyms that make the paper difficult to read. e.g. LD for Levenshetin distance is not a common or well known acronym. It would be useful for the authors to rescheck their answers on the reproducibility checklist.
I have questions about the motivation of this work. Good, totally accurate decompilers exist already, and have for a very long time. The authors argue that the reason we need such a neural decompiler is that the development time is so much less… it takes a lot of time and effort to develop a classical decompiler. That may be true, but the decompiler described in this paper has the significant disadvantage of not being “correct” in the sense that its programs are not guaranteed to compile: they may have type errors, variable use without declaration errors, or semantic errors that are specific to the PL being decompiled into. This is a significant drawback compared to a classical decompiler, and the authors need to address this issue in the paper by answering the question: what are the applications where it is helpful to decompile into code that (possible) has all sorts of semantic errors, and why? Along those same lines, the paper needs to report the fraction of decompiled codes that have compilation errors. In general, the evaluation could be improved significantly. Really, the main results given are in Table 1, which gives the “token accuracy” of the various methods for recovering the original code. First, it is not clear to me what this actually means. Does this mean that if I lex the two programs (original and recovered) that this is the fraction of tokens that match? Why is this a meaningful metric? If I have even one token wrong, couldn’t this fundamentally change the semantics of the program that I have recovered? It seems that a good decompiler should have two main requirements. First, the semantics should match the original code. Second, the decompiled code should somehow look similar (the original program should be structured the same way as the original program, variable names should be meaningful and hopefully match the original, etc.). “Token accuracy” at least somewhat addresses the second, but the semantic correctness is not addressed at all in the paper. The authors do define the notation of “program accuracy” (100% token accuracy) but as far as I can tell, that is not given in the paper. And, as I mentioned above, metrics such as the percentage of output codes that actually compile are not given. It would be really helpful to see how well things work when the programs get longer. The programs that are being decompiled are all very short… “long” programs have length 30 (is this 30 lines of assembly?). That does not seem to be very long. It’s a little unclear to me whether the programs being decompiled are all simple, straight line codes, or whether they include more complicated structures (such as static global variables, classes, function pointers, function calls, function bodies, etc.). Obviously, if the paper is only dealing with programs that do not have these structures, the methods being developed are a lot less useful. I found it strange that in a paper about decompilation, no example codes are given! Why don’t the numbers in Tables 1 and 2 match? That is, I would expect some of the numbers to be the same across the two tables, because you are testing the same thing. But this does not appear to be the case. Why?