DeepBits Deep Thoughts: A Comparative Review of Embedding based Binary Code Search Techniques

1 Introduction

Figure 1: Embedding based binary code search technique

Recently, the researcher Thomas Dullien from Project Zero, published an interesting article [1] to find statically-linked vulnerable library functions in executable code. It employs embedding based binary code search technique, which has drawn increasing interests from both industry and academia [1, 3, 4]. More specifically, as illustrated in Figure 1, given a piece of binary code (e.g., a function), raw feature (CFG, basic block, call graph, etc.) is first extracted. Then machine learning based approach is applied to the raw feature to generate embedding (numerical value). The code similarity between two pieces of code is measured by the distance between two embeddings. Thus, the embeddings can be fed into different models for malware classification, vulnerability search, plagiarism detection, etc. The analysis results should be improved compared with using traditional features like opcode sequence, API call, etc. since the embeddings preserve the high-level semantic information.

Although researchers have demonstrated the promising applications of the embedding based code search technique, in the real-world scenarios, there are still many challenges to overcome before industry deploys this technique. For instance, the same piece of code can be compiled in different compilers, different optimization levels, and even different architectures. It is not that straightforward to apply embedding-centric binary analysis on practical use. In this article, we conducted a comparative study on the latest three embedding-based code similarity detection methods (ASM2Vec, Funsimsearch, Gemini). We would like to measure their training time, evaluation time, and whether they are resilient to different platforms, optimizations, architecture, and obfuscation. In the talk, we will show how we design the experiments, and present the evaluation results. By analyzing those results, we would like to present the insights we learned on how to make the embedding binary analysis practical for industry deployment.

2 Background & Related Work

2.1 Overview of Binary Code Search Techniques

In general, code search technique is composed of the following three steps: feature extraction, embedding, and matching.

Raw Features The debate on how to select features never ends. Some insist that carefully hand-crafted features, with the help of their domain knowledge, can help improve the performance, while others believe machine-learning algorithms are able to learn important features from raw data. However, some experiments show that deep learned features outperform hand-crafted features, and other experiments show the opposite.

An experienced reverse engineer usually utilizes control flow graphs (CFGs), call graphs, combinations of certain instructions, API calls, etc., to get a brief idea of the functionality of a function. Among these features, CFGs and call graphs provide structural information, and the rest give semantic information.

Embedding CFG gives a brief idea of how the function looks like, and if two functions look completely different according to their CFGs, it is not likely that they are similar in terms of functionality. Unfortunately, the graph is not so friendly to machine learning algorithms, the graph may be too large to feed into a machine, and there is not yet an algorithm that can do graph comparison fast enough. To solve this problem, graph embedding is proposed to project a graph into a fixed-size one-dimensional vector while trying to maintain the graph topology, edge relationship, and other relevant information. This method solves all the problems mentioned above, with the cost of losing certain information in the graph.

Matching Algorithm Given two vectors, there are tons of ways to compare the similarity between them. Manhattan distance, Hamming distance, Euclidean distance, cosine similarity and so on. Each measure has its own output range and application, so it is not easy to find a suitable measure of a specific graph embedding method.

2.2 Asm2Vec

Asm2Vec [3] is a representation learning model for assembly code syntax and control flow graph. It first extracts assembly functions, their basic blocks and CFGs using IDA Pro, and then generates multiple sequences for an assembly function by performing a random walk on top of the CFG, while guaranteeing edge coverage. These sequences are now sequentially laid out, thus fit better in the NLP model. Finally, the model generates embeddings and they are compared using cosine similarity.

2.3 FunctionSimSearch

The FunctionSimSearch [1] uses SimHash [2], a locality-sensitive hashing to generate embeddings of the inputs, and a neural network to learn the weights of each feature. This solution is still on its early stage, so for the moment, the inputs are subgraphs of the control-flow graph, n-grams of mnemonics of disassembled instructions, and constants from operands. Hashes are then compared using hamming distance.

2.4 Gemini

Gemini [4] is a deep neural network-based approach to generate embeddings for binary functions for similarity detection. This approach takes attributed control flow graph as input and generates embeddings using a pre-trained model. It combines graph embedding networks into a Siamese network to make the embeddings of two similar functions close to each other, and the pre-trained model can be retrained quickly to adapt to new application scenarios.

3 The Comparative Study

In the real-world scenarios, the same piece of code can be compiled in the different compiler, different optimization level, and even different architecture. Moreover, the code might be inlined in other code, making code search more difficult. We designed several experiments to test if these approaches fit into real-world scenarios. We would like to measure their training time, evaluation time, and whether they are resilient to a different platform, optimization, architecture, and obfuscation.

3.1 Efficiency

We first would like to measure efficiency. All three approaches are academic researches, and we want to find out whether they can be applied in industry.

As we mentioned above, there are usually three steps to compare binaries: feature extraction, graph embedding, and matching. We will measure the time these three approaches take in each step, find out which step is the most time-consuming part, and propose some possible improvements.

We picked 251 Windows 10 executable files whose code size varies from 10KB to 20MB, and there are 1,509,664 functions in total. We select code size instead of file size because binary files may contain resources of any size, and anything but the code itself is irrelevant in code search. The only exception is Asm2Vec where we were only able to put 63,302 functions in its dataset due to its long processing time. To measure the matching performance, we will compare each file in the dataset with the whole dataset. Asm2Vec has a really small dataset, so the matching time of Asm2Vec is for reference only. The experiment is done on a computer with Intel 4700MQ and 16GB memory. The result will show how much time each part takes as the code size increases.

3.2 Accuracy

This experiment is to measure the accuracy of these approaches in different scenarios. To make a fair comparison, we compiled the same source code, OpenSSL 1.1.0, under different compiling flags. In this way, functions with the same name are considered identical. We have compiled OpenSSL 1.1.0 in x64 and x86, both with optimization flag O0 to O3. In each comparison, the datset is split into 70% training set and 30% test set. The test set is then compared to itself, and do a function-level comparison. The exception is Asm2Vec, because its training set is also its codebase. We will compare its codebase with its codebase for Asm2Vec instead.

We count the number of true positives and false positives in top K results. For each function, if there exists at least one function in top K results that have the same function name, then this is a true positive. Likewise, for each function, the number of false positive is the number of functions in top K results that do not have the same function name with the queried function. The number of positive is the number of functions, and the number of negative is the number of function squared minus the number of matching pairs. The output of this experiment is ROC curves and recall rate. ROC curves are good enough to compare the performance, but we added recall rate as well because we were only able to retrieve top 100 results from Asm2Vec due to its excessively slow speed. Plotting partial ROC curves is weird, and the recall rate can give us a more straightforward comparison.

In our first experiment, we used OpenSSL 1.1.0 x64 O0 to O3 as our dataset. This experiment is to test whether these approaches can find functions even when they are compiled in different optimization level. This is one of the common problem reverse engineers are facing: functions compiled in different optimization levels can look totally different, and it is hard to tell whether two functions are identical with the naked eyes. In the second experiment, we used OpenSSL 1.1.0 O3 x86 and x64 as our dataset. This measures the performance when the code is compiled in a similar architecture. We also planned to compare the code in x64 with code in ARM, but FunctionSimSearch does not support ARM architecture, so we skipped that comparison.

Figure 2: Feature Extraction Time

4 Evaluation

4.1 Raw Feature Extraction

Figure 2 shows the linear growth of all three approaches. FunctionSimSearch uses DynInst, a high performance tool with the cost of limited architecture support, to disassemble and extract features from binaries, whereas Gemini and Asm2Vec use IDA Pro. If we look at the time they take to extract 1MB code, FunctionSimSearch is clearly the fastest, about 2 seconds, then Asm2Vec, about 40 seconds followed by Gemini, 100 seconds. FunctionSimSearch outperforms the other two because it does not use auto analysis as IDA Pro does by default, and it lazy evaluates the disassembled results, moving some time to embedding and matching part. On the other hand, both Asm2Vec and Gemini use IDA Pro and extract similar features. The difference in performance indicates that the extraction script of Gemini can be further improved.

Figure 3: Embedding Time

4.2 Embedding

Figure 3 indicates a linear growth, though it is hard to say whether the embedding time of Asm2Vec is a linear growth due to its small sample size. This time Gemini is the fastest as it takes only 2 seconds to embed 1MB code, FunctionSimSearch 30 seconds and Asm2Vec probably 10 minutes. Gemini is fast because it uses highly optimized and vectorized code to generate embedding, bulk input and output. Asm2Vec is slow probably because it has to do random walk on CFGs.

Figure 4: Matching Time

4.3 Matching

Figure 4 presents the matching time. Gemini, again, is fastest due to its highly vectorized code. The result of Asm2Vec is not shown on the figure because it has an extremely small dataset (about 2% of the complete dataset). Asm2Vec takes about 20 seconds to compare the 300KB code with 60,000functions, so it is not comparable to the other two.

Figure 5: Total Time

4.4 Overal comparison

Figure 5 shows that Gemini spends most of the time on feature extraction, whereas FunctionSimSearch on matching. In total, Gemini is about two times faster than FunctionSimSearch. Asm2Vecis excluded in this part because of its small database.

Figure 6: Cross Optimization Evaluation

4.5 Cross Optimization Level

Figure 6 presents ROC curve and recall rate for cross optimization evaluation. The first thing we immediately noticed was that when K = 1, none of these approaches had a good performance. We then looked into the OpenSSL binary code and found almost half of the OpenSSL functions were external functions, which means there is no single instruction in that function. This is probably the reason why the performance is bad at the beginning. The result shows that Gemini outperforms the other two. This is probably because it is better trained, and it uses a model that fits better incode search. Gemini is able to run 10,000 iterations in 1 minute whereas Asm2Vec takes days andFuncionSimSearch takes probably weeks to reach the same iteration number.

Figure 7: 32 and 64 bit Evaluation

4.6 X86 vs. X64

Figure 7 presents ROC curve and recall rate for x86 vs x64 evaluation. In this test, Gemini can better identify functions than in cross optimization scenario. This makes sense because x64 is an extension x86, not a completely different architecture. There are some new opcodes and operands, but the features Gemini extracts is not sensitive to that kind of change, so the performance is pretty good. FunctionSimSearch has mnemonics as its features, so it kind of explains the performance degradation. Likewise, Asm2Vec is sensitive to the changes of opcodes and operands, so its performance is also not good.

5 Discussion

In our experiments, we found if we excluded functions that had less than 5 basic blocks, the accuracy increased a lot. This, we figure, is because about half of OpenSSL binaries are external functions, and they do not have a single instruction, so there is no way to tell differences among these external functions. The second reason would be small functions tend to have similar features, so one possible way to improve this is to extract different features for small functions and large functions.

Gemini is slow on feature extractions, we found some outliers on the figure. This indicates that the extraction script cannot well process some binaries on some circumstances, so it is a good idea to improve the extraction script. We can also infer that DynInst is much faster than IDA Pro. Though we do not know the accuracy of the disassembled results yet, it is worth to try different disassemblers.

In the accuracy tests, the results show that some features are sensitive to even small changes in binary code. This tells us we need to be careful when choosing features. Whether deep learned features are better than manually selected features requires more experiments.

6 Conclusion

The experiments show that Gemini outperforms the other two in both efficiency and accuracy. Gemini is able to extract features, embed and match 1MB code with about 270MB codebase in 100 seconds. FunctionSimSearch, on the other hand, takes about 300 seconds. As for accuracy, Gemini reaches 100% true positive rate while the false positive rate is about 10%. FunctionSimSearch reaches 90% true positive rate when the false positive rate is 20%. In our experiments, Gemini is the best of all three approaches.

References

[1] Searching statically-linked vulnerable library functions in executable code. https://googleprojectzero.blogspot.com/2018/12/searching-statically-linked-vulnerable.html .

[2] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002.

[3] S. H. Ding, B. C. Fung, and P. Charland. Asm2vec: Boosting static representation robustness for binary clonesearch against code obfuscation and compiler optimization. In IEEE S&P, 2019.

[4] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 24th ACM Conference on Computer and communications security (CCS’17), Oct. 2017.

DeepBits Deep Thoughts

Thursday, May 9, 2019

A Comparative Review of Embedding based Binary Code Search Techniques