DeepBits Deep Thoughts: A Fast and Accurate Disassembler based on Deep Learning

1. Problem Statement

A disassembler takes a binary program as input and produces disassembly code and some higher-level information, such as function boundaries and control flow graphs. Most binary analysis tasks [1, 2, 3, 4] take disassembly code as input to recover syntactic and semantic level information of a given binary program. As a result, disassembly is one of the most critical building blocks for binary analysis problems, such as vulnerability search [5, 6], malware classification [7], and reverse engineering [8].

Disassembly is surprisingly hard, especially for the x86 architecture due to variable-length instructions and interleaved code and data. As a result, a simple linear sweep approach like objdump or capstone, despite its high efficiency, suffers from low disassembly correctness on Windows binaries and binaries compiled by the Intel C++ Compiler (where jump tables are placed in the code section) and can be easily confused by obfuscators. There has been a long history of research on improving disassembly accuracy. For instance, the recursive disassembly identifies true instructions by following control transfer targets. It largely eliminates false instructions but may miss many true instructions that are not reached by other code blocks, leading to a low true positive rate. It also suffers low runtime efficiency because it needs to traverse the control flow repeatedly. Commercial disassemblers like IDA Pro and Binary Ninja employ linear sweep and recursive traversal along with undocumented heuristics to achieve high disassembly accuracy, at the price of low runtime efficiency. Recently, researchers have explored a probabilistic approach [9] and more expensive static program analysis [10] to achieve high disassembly accuracy.

However, even though disassembly strategies have been improved for years, an almost 20-year-ago obfuscator [11] can easily trick commercial disassemblers which have been widely used in malware analysis. In our experiments, IDA Pro missed 58% of code, and Binary Ninja consumed all 16 GB memory in our test environment and was eventually killed by OS. Besides that, these sophisticated disassembly approaches all suffer from low runtime efficiency. Our experiments show that IDA Pro can only process 200 KB/s, Binary Ninja 50 KB/s, and Probabilistic Disassembly about 4 KB/s.

Sadly enough, despite the importance of disassembly, we still do not have a disassembler that is both accurate and fast for the cybersecurity industry. Even though there are many machine-learning-based malware classification approaches, most of them [12, 13, 14] simply apply IDA Pro or other state-of- the-art disassemblers to get necessary data, and focus on model efficiency or do not care about efficiency. Others [15] may get a preliminary disassembly result through linear sweeping. However, the linear sweep approach is vulnerable to adversarial attacks and can be exploited by malware. As a result, in cybersecurity practices, disassemblers are mostly used in less time-critical tasks (like postmortem analysis and incidence response). As for time-critical tasks (like malware detection), raw binary code is directly fed into machine learning models or Yara rules3, or linear disassembly, because sophisticated disassemblers cannot handle large volumes of binaries fast enough. Such approaches are extremely vulnerable to adversarial attacks since only superficial features are picked up at a raw-byte level [16].

2. Our Solution (Patent Pending)

In this work, we aim to tackle this blocking issue by leveraging recent advances in deep learning and increasing parallelism in modern processors. We develop a disassembler called DeepDi, which can achieve high accuracy and efficiency simultaneously. With the well trained deep learning model, DeepDi is able to distinguish legitimate instructions from the raw binary, recovers basic block and function entry points from the legitimate instructions using heuristics and a simple classifier.

3. Evaluation

We conducted extensive experiments to evaluate our accuracy, efficiency, and robustness. To evaluate the accuracy, we use two widely-used datasets: BAP corpora [17] and SPECint 2006, and compare with two commercial disassemblers IDA Pro and Binary Ninja, an open-source disassembler Radare2, and a recent academic prototype Probablisitic Disassembly [9].

Evaluation Metrics. At the instruction level, we compare DeepDi with all the baselines and use the same evaluation metrics that were used by Probabilistic Disassembly [9], namely the False Negative Rate (FNR) and False Positive Rate (FPR) against binary size. For the basic-block and function levels, since objdump and Probabilistic Disassembly do not output basic block and function information, we only compare precision and recall with IDA Pro, Binary Ninja, and Radare2. We use the precision-recall curve to report our performance.

3.1 Accuracy and Efficiency

In this section, we evaluate the accuracy and efficiency of DeepDi and other baseline tools.

Dataset: We conducted experiments on BAP corpora [17] and SPECint 2006 programs, which are also used by Probabilistic Disassembly [9]. The BAP corpora contain 2,044 x64 ELF binaries compiled by GCC and the Intel C++ Compiler with optimization level O0 to O3. The SPECint 2006 is compiled by GCC with O2. These binaries are stripped before applying the evaluation.

Training and Testing Details. We randomly shuffled the dataset and divided it into a training set and a testing set using 10-fold cross-validation. Each training phase lasted for only 2 hours. The ground truth is the offsets of valid instructions for each binary file in the dataset.

3.1.1 Instruction Level Evaluation

we measure the FPR and FNR at the instruction level. Figure 1 shows the distributions of the FPR and FNR for each baseline. We can observe that DeepDi is the most stable system as it has almost no outliers. All disassemblers have a very low false-positive rate, but the recursive disassembly seems to be very conservative and can miss almost half of the instructions in some extreme cases.

Figure 1: FPR and FNR at Instruction Level

Radare2 shows some terrible disassembly results in Figure 1f, where the FNR is close to 80%. This is partly because of our top-down evaluation process. For existing approaches like Radare2 and Binary Ninja, we first get all the functions identified by these tools, and then get all basic blocks within each identified function, and finally get all instructions within each identified basic block. Radare2 performs poorly on function identification, which leads to poor results in instruction identification.

We did not observe any dangling instruction (those do not belong to any functions) in Binary Ninja. Still, Binary Ninja does not perform well on some binaries and treats many functions as data. We couldn’t find an explanation as there are no clear patterns. We can only extrapolate that Binary Ninja is very conservative.

IDA Pro, on the other hand, generates many functions called “_unwind”. They are not considered as regular functions by IDA. However, even if we treat them as correct functions, IDA still has a relatively high FNR. We studied an outlier binary “xalancbmk” from the SPEC benchmark, and found that many functions were treated as data even though there is a cross-reference from the .data.rel.ro section. The underlying reasons are yet unknown to us.

Objdump has a higher-than-average false positive rate. This result contradicts the study by Andriesse et al. [18]. This is because the definitions of false positives are different. We treat the gap-filling nop instructions as data because they will never be executed theoretically. Likewise, the code after non-return functions is treated as data for the same reason. If we consider these instructions as legitimate instructions as in [18], the false positive rate is about the same as the false-negative rate. They are still not zero because the Intel C++ Compiler also puts jump tables in the code section.

We could not obtain the source code of Probabilistic Disassembly, so here we report their results in their paper [9]: FPR is less than 5% for medium to large binaries, 3.7% on average, and there is no FNR. Table 1 shows the average FPR and FNR for each baseline approach. In comparison, DeepDi works very well. It achieves a very low FPR of 0.016% and
FNR of 0.017%, and the best F1 score.

Table 1: Instruction Statistical Data

3.1.2 Basic Block Level Evaluation

Figure 2: Basic Block Precision-Recall

The distributions of the basic-block level precision and recall are shown in Figure 2. The results of DeepDi, IDA Pro, and Binary Ninja cluster at the top-right corner, meaning they have both high precision and recall. The results of Radare2 are not good, and we find that functions without any cross-references are not correctly identified, which explains the high FNR. We also find that Radare2 has some false positives due to cross-references from the .rela.dyn section.

Table 2: Basic Block Statistical Data

Table 2 shows the average precision and recall. All tested disassemblers have close to 100% precision on average despite some outliers. Our technique has the highest recall with the lowest recall deviation, and IDA Pro has the best precision and precision deviation. Further investigation shows that DeepDi has some outliers because it failed to resolve some jump tables, thus we are unable to identify switch-case statements with fallthrough.

Figure 3: Function Precision-Recall

Table 3: Function Statistical Data

3.1.3 Function Level Evaluation

Figure 3 shows the distributions of the FPR and FNR at the function level for each binary. Table 3 shows the statistical data of the FPR and FNR. Our function recovery result is not the best, but it still beats IDA Pro and Radare2 in terms of F1 score. Again, Radare2 suffers from missing function signatures due to the lack of cross-references, and IDA Pro and Binary Ninja treat some functions as data for some reasons we still do not know yet.

Figure 4: Efficiency Evaluation (DD: DeepDi)

3.2.3 Efficiency

Figure 4 shows the correlations between binary size and the disassembly time for our approach, IDA Pro, Binary Ninja, objdump, and Radare2. The y-axis of this figure is log-scaled. For our approach, the disassembly time also includes basic block recovery and function identification. For Radare2, we use the “aaa” command to choose the lightest yet sufficient analysis. For Binary Ninja, we set the analysis mode to basic. For IDA Pro, we use its default settings and try to avoid unnecessary operations. Although these disassemblers may still do some extra analyses (e.g., resolving indirect jumps), it is safe to say the efficiency of recursive disassemblers will not exceed linear sweep disassemblers, and our GPU-implementation is already five times faster than objdump.

DeepDi can process about 24.5 MB binary code per second on GPU and 1.64 MB binary code per second on CPU. The GPU version is 100 times faster than IDA Pro, and CPU version is still about eight times faster than IDA Pro. As a comparison, the throughput of IDA Pro is 200 KB/s, Binary Ninja 50 KB/s, and objdump 4.5 MB/s.

3.3 Obfuscation Evaluation

We conducted an experiment to see whether our approach is resilient to obfuscations, and how it is compared to the disassemblers with sophisticated heuristics. The obfuscation techniques should be able to generate hard-to-disassemble code. We do not consider obfuscators such as Obfuscator-LLVM [19] because they generate hard-to-read code to provide software security and tamper-proofing, but do not challenge disassemblers. The obfuscator [11] we used was proposed 17 years ago, but its techniques are still widely adopted in modern malware. In that paper, the authors proposed to insert junk code after unconditional jumps to confuse the linear sweep disassembly and after conditional jumps that always evaluate to true or false to confuse the recursive disassembly. Moreover, unconditional jumps are redirected to a universal function where the function modifies its return address based on callers. This nonstandard behavior hides jump targets and breaks heuristics. We trained our model on the dataset introduced in Section 4.2, and directly used the dataset and the ground truth provided by Linn and Debray [11] as the testing set, which contains 11 x86 ELF binaries of the SPECint 2000 benchmark suite that have been obfuscated by their tool.

Table 4: Obfuscation Test Results

Table 4 shows that although disassemblers have been improved for years, these 17-year-old tricks can still confuse the state-of-the-art commercial disassemblers. IDA Pro gives 58% false negative rate, and Binary Ninja even consumed all system memory and got killed by the OS before it could analyze an obfuscated binary completely. On the contrary, our model has only 4.5% false positive rate and 5.2% false negative rate, which shows the resilience to obfuscation. Once we added only two obfuscated binaries (gcc and bzip2) to our training set, the false positive rate drops to 0.12% and the false negative rate drops to 0.12%. The accuracy of basic block and function entry points is not evaluated because this obfuscated dataset [11] only has the ground truth for correct instruction addresses

Table 5: Gemini’s Efficiency

4 Downstream Application

To further illustrate how machine learning-based solutions can benefit from our technique, we conducted an experiment with Gemini [6], a popular code similarity detection tool, to show how we can apply Gemini to real-time analysis with the help of DeepDi. Gemini has three components : Feature Extraction, Embedding, and Matching. Feature Extraction extracts manually selected features from binaries; Embedding transforms the features to fixed-length vectors, and Matching searches these vectors in its database.

To evaluate Gemini’s effectiveness, we randomly selected 214 Windows 10 x64 system DLLs and got debug symbols from Microsoft public symbol server. The total size of these DLLs is 264MB. We recorded the time it used in Feature Extraction and Embedding, as well as the overall end-to-end embedding generation time. We changed our post-processing step to extract the same features as Gemini does, and we used our GPU version for this test.

Figure 5: Gemini Evaluation

Our experiment shows that Feature Extraction takes about 98.9% of the total processing time, and this is because the original Gemini has a complicated IDAPython12 script to extract features from binary code. Table 5 shows we reduce the Feature Extraction time from 15.96 hours to 3.5 minutes, and the end-to-end embedding generation time from 16.14 hours to 14.6 minutes. In other words, we increase Gemini’s throughput from 4.7 KB/s to 307 KB/s. Our disassembly efficiency is not as good as we report in last section because we implemented feature extraction on CPU and did not do much optimization. Even so, our approach only adds minimal overhead to Gemini, and now the bottleneck is no longer Feature Extraction. Figure 5 shows how our approach improves Gemini’s efficiency. The y-axis is log-scaled.

Although code similarity detection is an offline task, we find that it can be used for malware classification, which essentially turns it into an online task. Given function embeddings of a binary, we apply Principal Component Analysis (PCA) [20] to generate a new embedding representing this binary. The similarity between two binary embeddings tells us whether they belong to the same family. We compare this approach with ssdeep [21], a widely used tool for computing fuzzy hashes for malware, and find that Gemini is more liable in classifying binaries.

Figure 6: Malware Classification Results

In this experiment, we randomly pick ten WannaCry samples13 and ten OpenSSL binaries (1.0.2, 1.0.2n, 1.1.0a to 1.1.0h). We project all function embeddings of a binary to a 128-dimensional vector and use cosine distance to compare similarities between every two binaries. For ssdeep, we retrieve a similarity score (0 to 100) for every two binaries and convert that score to a new range 1 to 1 for a better comparison with cosine distance. We use a heatmap to plot the similarity values between binaries, as shown in Figure 6. The label 0 to 9 represents the WannaCry family, and 10 to 19 are OpenSSL binaries.

Figure 6a shows that Gemini can perfectly group these 20 files into two categories. The similarity score between files in the same family is close to 1, and the score between files in different families is close to 1. On the contrary, as illustrated in Figure 6b, only a few pairs of files in the same family are considered similar by ssdeep. This result proves that Gemini is suitable for malware classification and has much better accuracy than ssdeep. Once combined with DeepDi, Gemini is able to classify binaries efficiently.

5. Summary

In this article, we have proposed DeepDi, a novel deep learning-based technique for disassembly that achieves both accuracy and efficiency. Our experimental results have shown that DeepDi has the best F1 scores on regular binaries, at all the three levels (instruction, basic block, and function), and it is eight times faster than IDA Pro, and its GPU version is 100 times faster. DeepDi is able to generalize to unseen binaries, and counter obfuscations and certain adversarial attacks. When used with Gemini for code similarity search, DeepDi improves the overall execution time by 94% and improves the throughput from 4.7 KB/s to 307 KB/s. We have also proved that Gemini is capable of classifying malware accurately and efficiently when combined with DeepDi.

6. Reference

[1] Alessandro Di Federico, Mathias Payer, and Giovanni Agosta. rev. ng: a unified binary analysis framework to recover cfgs and function boundaries. In Proceedings of the 26th International Conference on Compiler Construction, pages 131–141, 2017
[2] Johannes Kinder and Helmut Veith. Jakstab: A static analysis platform for binaries. In International Conference on Computer-Aided Verification, pages 423–427. Springer, 2008.
[3] Manish Prasad and Tzi-cker Chiueh. A binary rewriting defense against stack-based buffer overflow attacks. In USENIX Annual Technical Conference, General Track, pages 211–224, 2003.
[4] Alexander Sepp, Bogdan Mihaila, and Axel Simon. Precise static analysis of binaries by extracting relational information. In 2011 18th Working Conference on Reverse Engineering, pages 357–366. IEEE, 2011.
[5] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 480–491, 2016.
[6] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security - CCS 17, 2017.
[7] Xin Hu, Tzi-Cker Chiueh, and Kang G. Shin. Large-scale malware indexing using function-call graphs. In Proceedings of the 2009 ACM SIGSAC Conference on Computer and Communications Security, 2009.
[8] Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, and Wenke Lee. Eureka: A framework for enabling static malware analysis. In European Symposium on Research in Computer Security, pa ges 481–
500. Springer, 2008.

[9] Kenneth Miller, Yonghwi Kwon, Yi Sun, Zhuo Zhang, Xiangyu Zhang, and Zhiqiang Lin. Probabilistic disassembly. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, pages 1187–1198, Piscataway, NJ, USA, 2019. IEEE Press.

[10] Rui Qiao and R. Sekar. Function interface analysis: A principled ap- proach for function recognition in cots binaries. In The International Conference on Dependable Systems and Networks, pages 201–212, 2017.

[11] Cullen Linn and Saumya Debray. Obfuscation of executable code to improve resistance to static disassembly. In Proceedings of the 10th ACM conference on Computer and communications security, pages 290–299, 2003.

[12] Bugra Cakir and Erdogan Dogdu. Malware classification using deep learning methods. In Proceedings of the ACMSE 2018 Conference, pages 1–5, 2018.

[13] Rafiqul Islam, Ronghua Tian, Lynn Batten, and Steve Versteeg. Clas- sification of malware based on string and function feature selection. In 2010 Second Cybercrime and Trustworthy Computing Workshop, pages 9–17. IEEE, 2010.

[14] Joris Kinable and Orestis Kostakis. Malware classification based on call graph clustering. Journal in computer virology, 7(4):233–245, 2011.
[15] Bojan Kolosnjaji, Ghadir Eraisha, George Webster, Apostolis Zarras, and Claudia Eckert. Empowering convolutional networks for malware classification and analysis. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 3838–3845. IEEE, 2017.
[16] Researchers easily trick cylance’s ai-based antivirus into thinking mal- ware is ’goodware’. https://tinyurl.com/y5b53y2u.
[17] David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. Bap: A binary analysis platform. In International Conference on Computer Aided Verification, pages 463–469. Springer, 2011.
[18] Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In Proceedings of the 25th USENIX Conference on Security Symposium, SEC16, page 583600, USA, 2016. USENIX Association.
[19] Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. Obfuscator-LLVM – software protection for the masses. In Brecht Wyseur, editor, Proceedings of the IEEE/ACM 1st International Work- shop on Software Protection, SPRO’15, Firenze, Italy, May 19th, 2015, pages 3–9. IEEE, 2015.
[20] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
[21] Jesse Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3:91–97, 2006.

DeepBits Deep Thoughts

Tuesday, June 23, 2020

A Fast and Accurate Disassembler based on Deep Learning