Review for NeurIPS paper: CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching

NeurIPS 2020

CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching

Review 1

Summary and Contributions: This paper considers the problem of learning to search for compiled binaries that match a particular source code. While others have considered this problem before, this particular proposal considers function-level search, which ha not been considered before. The idea is as follows. The first is the source code itself, the second is the set of strings in the code, and third is the set of integers in the code. For encoding characters, the authors use DPCNN. For context free grammars, a graph neural network is used. And LSTM is used to encode the sequence of integer literals, and a hierarchical LSTM is used to encode strings. To control the choice of negative samples during learning, the authors adapt a distance-weighted sampling method from the literature, and allow the inclusion of a parameter s that adds some entropy to the distribution of probabilities that each of the various database objects are chosen as the negative samples. Experiments show impressive accuracy. It is possible to get nearly 90% recall@1 in all four of the experiments that were run.

Strengths: The number one positive aspect of the paper is the very impressive accuracy numbers that are reported. I think that one could point to the simplicity of the approach as a positive aspect.

Weaknesses: I thought that the motivation for function-level matching was a bit weak. I would have liked the paper to open with a scenario or two where the his useful, or absolutely necessary. Another issue is that the paper is not necessarily technically deep. Though I hesitate to be too tough on the paper for that reason; being able to get really good results with a simple method may be considered a feature of the approach. Actually, it is surprising to me that the authors treat source code as text, and binary code as a context free grammar. In fact, it even more natural to treat source code as a CFG. Binary code is often very simple, whereas source code typically has complicated constructs.I would have liked some more explanation along these lines. I found it problematic that the paper is not self-contained, at least with respect to the loss function and the negative sampling method used. The authors use terms such as the “anchor” with no definition. In equation (1), we have the “margin” which is not defined. I would say that triplet loss is widely known, but it is not canonical (that is, you can’t just assume that people know what you are talking about, unlike a concept such as an LSTM). Just a few sentences explaining some of these ideas would make the paper much more readable. It was a bit difficult to understand what was being shown in some of the rows in Table 1. In the second section of the table, are you ignoring integers and strings? And in the third section of the same, are you ONLY using integers and strings? If so, these seems a little strange, as I would expect that most readers want to see the effect of adding different types of features, in a cumulative fashion. I think that considering them separately is not too useful. Nowhere in the paper nor in the supplementary material is the data set on which this is tested detailed. Where did the codes come from? The information in Section 4.1 is not enough to have even an intuitive feeling for the type of code that is being searched.

Correctness: Yes.

Clarity: Yes, the paper is generally quite well-written (modulo the issues with the paper not being totally self-contained).

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: As mentioned above, I think that a detailed description of the data used is important. *** AFTER AUTHOR RESPONSE *** I'd like to thank the authors for their response. I feel that what it comes down to: the paper is not technically novel, though the application may be. Is that good enough for a NeurIPS paper? It appears as if it is. Congratulations!!

Review 2

Summary and Contributions: Paper Summary: This paper proposes a cross-modal retrieval method for function-level binary source code matching. It considers the binary source code matching problem as a cross-modal retrieval task. Different semantic features are proposed to represent the features of source and binary code. Experiments on two datasets demonstrate the superior performance of the proposed method.

Strengths: ++It seems interesting to consider the function-level binary code matching as cross-modal retrieval problem. ++The paper is well written and easy to follow.

Weaknesses: 1. The paper aims at proposing a cross-modal retrieval method. However, the comparisons with state-of-the-art cross-modal retrieval methods are missing. For example, Deep Supervised Cross-Modal Retrieval CVPR 2019, Deep Cross-Modal Hashing, CVPR 2017 2. The contributions on the cross-modal retrieval part are very weak. It concatenates the features and add a batch normalization layer to get the alignment embeddings. Triplet loss is designed to achieve cross-modal correlation. The technique part is weak compared with many cross-modal retrieval methods in Computer Vision and Multimedia. I cannot identify the innovation of this part. 3. The paper adopts Deep Pyramid Convolutional Neural Network (DPCNN) for source code feature extraction and Graph Neural Network (GNN) for binary code feature extraction. These methods are simply brought from other areas. The contributions of this part are not enough. After reading the rebuttal, I want to keep my original evaluation.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: This work: - sets up the problem of matching source code and binary code as a cross-modality retrieval problem; - sets up neural architectures for the two modalities and connects them in a conventional cross-modality retrieval setup; - provides an empirical comparision between the proposed approach and standard baselines.

Strengths: - This work addresses a practical and (in my view) important problem. - Empirical results seem strong. - Nice figures, e.g. Figure 1 to illustrate the problem.

Weaknesses: - The writing could be improved both on the stylistic side and on the clarity of explanations side. - The work connects standard components in a natural way, but this is not necessarily an issue. - If I understood the lines in Table 1 correctly there does not seem to be an ablation evaluating the importance of having the code literal features in the proposed approach?

Correctness: I have not spotted issues with correctness. I have not read the appendix.

Clarity: Clarity, formatting, and language could be improved. For example, the sampling method around equations (7)-(8) could be explained more thoroughly. Minor comments: - Line 137: I think "research" is usually not pluralized. - Equations (2)-(6) could be attempted to be explained in words, or a figure added. - Section 3.4: Could mention that the new hyperparameter s can be seen as a negative temperature parameter, sometimes denoted beta. - Section 3.4: Usage of the x symbol to denote standard multiplication may not be necessary.

Relation to Prior Work: Prior work is discussed rather extensively in Section 2; unfortunately I'm not familiar enough with this domain to be able to say for sure whether something was missed.

Reproducibility: Yes

Additional Feedback: - For the "source to binary" task, wwould it be even remotely feasible to try compiling the source under many combinations of target architectures and compiler options, and then look for similarities using a more primitive matching algorithm? I assume not, but a comment could perhaps be helpful. - Line 163-164: Not sure if I understand correctly, in binary code ordering may not matter, but in source code there seems to be potentially useful information in how statements follow each other. Therefore, is the stated "accordance" actually desired? - Line 169: I found the statement that only 10% of the code can be successfully parsed surprising. I would have assumed code must be parse-able if it can be compiled. Can you please explain a bit more? - Table 1: Why is the line "Random" not corresponding to a line from the 2nd block? Should the results be identical to the DPCNN+HBMP line, and the difference is due to randomness? (If so it might be helpful to add confidence intervals and/or reduce the precision of reported accuracies, perhaps to 0 decimal digits.) - Would the results get even better with a larger value of s? It might be nice to show when recall starts to decline as s is increased. - Broader impact statement: Could a potential negative outcome occur if the tools to discover known vulnerabilities in a wide range of programs are used by malicious actors?

Review 4

Summary and Contributions: The paper proposed a new model for functional-level binary source code matching. The major challenge of this task is how to learn a cross-modal embedding space. The author proposed two modality-specific encoders: a CNN-based source code encoder and a GNN-based binary code encoder. In order to deal with the string and integer literals in the data, the author proposed to use a hierarchical-LSTM to encode the string features and an integer-LSTM to encode the integer features. To better train the cross-modal embedding space, the author used the triplet loss and sampled the negatives based on the distance. The final model achieves significantly better performance than the baselines. The contribution is the introductiion of the function-level binary source code matching problem and a novel deep learning based solution to this problem.

Strengths: The paper is novel and shows that deep learning can help improve the SOTA of functional-level binary source code matching. The model design is reasonable. The ablation study in Table 1 proves the effectiveness of each design choice.

Weaknesses: This is an application paper and may lack some novelties in the modeling part. However, according to my perspective, this is a minor limitation.

Correctness: According to the best of my knowledge, the evaluation is correct.

Clarity: Yes

Relation to Prior Work: Yes, it's clearly stated in Section 2.

Reproducibility: No

Additional Feedback: The modeling design is reasonable and experiment results are very solid. However, it will be easier for others to reproduce the results if the author could provide more details about the dataset. ———— After Rebuttal ———— I choose to keep my score because the paper proposed an effective solution to a new problem. Although it lacks novelty from the modeling side, it is a good application paper.