A Scalable Search Index for Binary Files

October 28, 2014 • Article

By

Wesley Jin, Chuck Hines, Cory Cohen, and Priya Narasimhan (Carnegie Mellon University)

In this article, the authors present a scalable architecture for searching and indexing terabyte-size collections of binary files.

Publisher

ACM, Inc.

Subjects

Reverse Engineering For Malware Analysis

Abstract

The ability to locate specific byte-sequences in large collections of binary files is important in many applications, especially malware analysis. However, it can be a time-consuming process. Researchers and analysts, such as those at CERT, often have to search terabytes of data for characteristic patterns and signatures, which can take upwards of days to complete. Although many search systems, designed specifically to expedite text and metadata queries, exist, these tools are unsuitable for searching files containing arbitrary bytes. By using probabilistic techniques to pre-filter likely search candidates, we present a scalable architecture for searching and indexing terabyte-size collections of binary files. Our implementation performs searches in minutes that would required days to complete using iterative techniques. It also reduces storage costs by balancing the amount of data indexed with the total time required to conduct and verify a query.

Software Engineering Institute