Thursday, May 9, 2019

A Comparative Review of Embedding based Binary Code Search Techniques

1 Introduction

Figure 1: Embedding based binary code search technique

Recently, the researcher Thomas Dullien from Project Zero, published an interesting article [1]  to find statically-linked vulnerable library functions in executable code. It employs embedding based binary code search technique, which has drawn increasing interests from both industry and academia [1, 3, 4].   More specifically, as illustrated in Figure 1, given a piece of binary code (e.g., a function),  raw feature (CFG, basic block, call graph, etc.) is first extracted. Then machine learning based approach is applied to the raw feature to generate embedding (numerical value). The code similarity between two pieces of code is measured by the distance between two embeddings.  Thus, the embeddings can be fed into different models for malware classification,  vulnerability search, plagiarism detection, etc. The analysis results should be improved compared with using traditional features like opcode sequence, API call, etc. since the embeddings preserve the high-level semantic information.

Although researchers have demonstrated the promising applications of the embedding based code search technique, in the real-world scenarios, there are still many challenges to overcome before industry deploys this technique. For instance,  the same piece of code can be compiled in different compilers, different optimization levels, and even different architectures.  It is not that straightforward to apply embedding-centric binary analysis on practical use.  In this article, we conducted a comparative study on the latest three embedding-based code similarity detection methods (ASM2Vec, Funsimsearch, Gemini).   We would like to measure their training time, evaluation time, and whether they are resilient to different platforms,  optimizations, architecture, and obfuscation.  In the talk, we will show how we design the experiments, and present the evaluation results.  By analyzing those results, we would like to present the insights we learned on how to make the embedding binary analysis practical for industry deployment.