Understanding Network Traffic Through Intraflow Data
January 2016 • Presentation
In this presentation, the authors describe experiments to collect intraflow data from network taps, endpoints, and malware sandbox runs.
Traditional flow data gives a broad view of network communications by reporting the addresses, ports, and byte and packet counts contained in a flow. This data is valuable, but it gives little insight into the actual content of flows, especially when nonstandard destination ports are used to circumvent policy. To obtain this missing insight, we investigated intraflow data, that is, information about events that occur inside of flows that can be conveniently collected, stored, and analyzed within a flow monitoring framework. Previous work in this area centers on the extraction of selected protocol fields (such as DNS names reported by YAF, for instance). Our work focuses instead on new types of data that are independent of protocol details, such as the lengths and arrival times of messages within a flow, and the distribution of bytes in the flow data. This data is especially useful in flow analytics, and has the attractive property that it applies equally well on both encrypted and unencrypted flows. In our experiments with intraflow data, we found that it can be used for the detection of malicious network activity, for the identification of the application associated with the flow, and for the understanding of encrypted traffic. For example, DNS tunnels show up in this data, as do other out-of-policy tunnels and encrypted sessions on non-standard ports. In this presentation, we describe our experiments—including the collection of intraflow data from network taps, endpoints, and malware sandbox runs—and the analysis of that data using statistical and machine learning techniques. We present data on the cost of computation, transmission, and storage for these new data types. We also demonstrate the software tools that we constructed to gather this data, which convert a packet trace directly into a JSON-based flow description in a way that embodies both an exporter and a collector, and thus facilitates rapid prototyping and exploration of new data types. (We plan to make these tools open source, hopefully before January 2016.)