search menu icon-carat-right cmu-wordmark

Large Scale Data Preparation for Machine Learning Models

February 2023 Presentation
Matthew Spitzer (Mayo Clinic)

This talk describes one methodology that has been applied to the preparation of large-scale data in support of ML modeling activities.


Software Engineering Institute



This presentation was given at FloCon 2023, an annual conference that focuses on applying any and all collected data to defend enterprise networks.

Cybersecurity threat actors continue to utilize modern technologies and advanced techniques to escape detection as they infiltrate corporate and government networks. One identifier of possible threat activity relates to long-running connections (LRCs). Locating LRCs in a data lake involves selecting parameters and feature extraction, including aggregations, to prepare for proper analysis. Machine learning (ML) models can potentially identify threats in ways that go beyond looking for LRC occurrences; however, thorough data preparation is a requisite step to instill confidence in the accuracy and correct interpretation of the results, and the data preparation process is the focus of this presentation.

Preparing the data for analysis involves evaluation of both the completeness and the accuracy of the data, determination of a process for fusing multiple disparate sources, and performing scalable statistical analysis to develop a deep understanding of the data in the lake. This analysis involves large data volumes, and feature extraction results in aggregation of network data fields to provide more efficient processing, which further serves as preparation for modeling exercises.

LRC parameter selection must also go through a level of feature extraction due to the myriad disjointed avenues that can be taken to identify behavior baselines. For example, searching by device, searching by location, or searching by time frames are all valid pathways of focus. In our case, long-running connections consisting of device connections lasting longer than 20 days provided the basis for our statistical analysis and helped inform the validation and verification steps necessary to prepare for model evaluation.

Initial feature extraction and verification involved aggregating fragmented network flow records, resulting in significantly reduced volumes of data needing to be scanned, and then performing statistical analysis. Results contributed to tuning levels for potential threats to look beyond just connection duration, but also to include port and protocol and connections to destinations outside of baselines.

The outcome of the statistical analysis also contributes to identifying patterns within the data based on distributions. These distributions formulate the basis for our model evaluation and support a higher degree of confidence in the results. The initial statistics help verify that the data is clean and representative. Utilizing the derived values from both the raw data and aggregated data enables consistent and reproducible outcomes to accurately identify deviations from expected behavior within our network.

Participants will be exposed to one methodology that has been applied to the preparation of large-scale data in support of ML modeling activities. By integrating this step-wise data preparation process as pre-work for ML design, models can be run more efficiently, and provide more straightforward insights into patterns associated with threats. Attendees can utilize this presentation to identify procedural gaps in feature extraction, gain a deeper understanding of their data, and apply these techniques in their own environments.