search menu icon-carat-right cmu-wordmark

Code Similarity Detection Using Syntax-Agnostic Locality Sensitive Hashing

August 2020 Presentation

This presentation describes how to maintain the security of large codebases by using Syntax-Agnostic Locality Sensitive Hashing (LSH) to detect and search for code similarity.





Maintaining software security as the volume of new code written increases is a pressing "big data" problem. Once a vulnerability is identified in one piece of software, identifying other software that might contain a similar vulnerability is critical. However, conducting this type of search is time-consuming and challenging. In this presentation, we discuss Syntax-agnostic locality sensitive hashing (Syntax-agnostic LSH), an efficient method for finding code with similar functionality in large code repositories. Our approach significantly reduces the amount of time analysts need to identify potentially vulnerable software.

LSH is known to successfully find near-duplicate documents at scale. It is also proven in applications such as audio/video//image searching, entity resolution, and fingerprint comparison. Applying LSH to software results in fast searching as it compresses code segments into hashes and eliminates the need for pairwise comparisons by clustering similarly hashed code segments together. Because we hash on the semantic meaning of code segments rather than the code itself, our variant of LSH handles varying code writing styles and compilation strategies that can cause code with the same functionality to look syntactically different.

The use of Syntax-Agnostic LSH as a code similarity detection and searching capability reduces the time, effort, and cost of debugging and maintaining software and allows us to be one step ahead of attackers. Our approach is both an investigative and preventative tool. It allows for much faster identification of code with both technical and logical vulnerabilities that need to be fixed, and it encourages the reuse of "repaired" code through its ability to search for code segments by functionality, rather than syntax. Our cyber team has incorporated Syntax-Agnostic LSH into its investigative platform, with the expectation that it will decrease the length of investigations from 3-4 weeks to under a week.