Friday, June 21, 2019

Dr. Binary: searching statically linked vulnerable functions in minutes

1. Introduction

 A complex software product often contains packages, libraries, or modules made by third parties, and these third-party components may again contain components from other sources. This is known as the software supply chain. Software supply chains are increasingly complicated, and it can be hard to detect statically-linked copies of vulnerable third-party libraries in executables. 

This blog post discusses how to use Dr. Binary to search statically linked vulnerable functions in executables.  We built httpd with statically linked OpenSSL library 1.0.2a. This OpenSSL has many known vulnerabilities (e.g., CVE-2015-1788). They are statically linked so such vulnerability cannot be detected simply by version based detection approaches. The following paragraphs will illustrate how to use Dr. Binary to identify this statically linked vulnerable function. 

Dr.Binary: Searching Vulnerabilities in Binaries

A vulnerability scanner is at the heart of a typical vulnerability management solution. It uses a list of known vulnerabilities to spot potential problems of the system.  Traditionally, a vulnerability scanner either conducts dynamic penetration test or statically checking the version of examined software for a match in a vulnerability database.  The more information the scanner has, the more accurate its performance.

Instead of conducting a penetration test or checking the version of binaries to find the known vulnerabilities, Dr. Binary took a different approach:   A software vulnerability can be represented as one or several code fragments.  Dr. Binary first extracts the vulnerable code fragments and generate "embeddings" as the vulnerability signature. Then given an input program, Dr. Binary decomposes it into code fragments, generate their embeddings, and then check these embeddings with the ones in the vulnerability database, to determine the presence of vulnerability.

Thursday, May 9, 2019

A Comparative Review of Embedding based Binary Code Search Techniques

1 Introduction

Figure 1: Embedding based binary code search technique

Recently, the researcher Thomas Dullien from Project Zero, published an interesting article [1]  to find statically-linked vulnerable library functions in executable code. It employs embedding based binary code search technique, which has drawn increasing interests from both industry and academia [1, 3, 4].   More specifically, as illustrated in Figure 1, given a piece of binary code (e.g., a function),  raw feature (CFG, basic block, call graph, etc.) is first extracted. Then machine learning based approach is applied to the raw feature to generate embedding (numerical value). The code similarity between two pieces of code is measured by the distance between two embeddings.  Thus, the embeddings can be fed into different models for malware classification,  vulnerability search, plagiarism detection, etc. The analysis results should be improved compared with using traditional features like opcode sequence, API call, etc. since the embeddings preserve the high-level semantic information.

Although researchers have demonstrated the promising applications of the embedding based code search technique, in the real-world scenarios, there are still many challenges to overcome before industry deploys this technique. For instance,  the same piece of code can be compiled in different compilers, different optimization levels, and even different architectures.  It is not that straightforward to apply embedding-centric binary analysis on practical use.  In this article, we conducted a comparative study on the latest three embedding-based code similarity detection methods (ASM2Vec, Funsimsearch, Gemini).   We would like to measure their training time, evaluation time, and whether they are resilient to different platforms,  optimizations, architecture, and obfuscation.  In the talk, we will show how we design the experiments, and present the evaluation results.  By analyzing those results, we would like to present the insights we learned on how to make the embedding binary analysis practical for industry deployment.