Over the past several years, John has made presentations at FloCon about incremental improvements in the SiLK paradigm for storing and analyzing network flow records. None of these presentations have made it into the mainstream as represented by the official SiLK releases, though at least one, the use of bit sets for indices, has been independently implemented and used by a number of analysts. In general, the current SiLK implementation remains tied to the technological constraints that guided its development in the early years of the 21st century. Based on informal experiments conducted at that time, the decision to work largely with flat, unindexed, and unordered hourly files, partitioned coarsely by data source and type criteria, and using block compression, was the correct one.
The purpose of this talk is to revisit those decisions in light of currently available technologies, notably the availability of cloud-based data warehousing such as Amazon’s Redshift and similar services that provide flexible, massively parallel data storage and searching mechanisms at relatively low costs. On the surface, these techniques appear attractive, but there are hints of problems since most are columnar stores that appear to be optimized for high dimensional data with relatively sparse rows and columns. Flow has low dimensionality and dense rows and columns. Our earlier experience with a commercial columnar data store that was not cloud based indicated that retrieving entire flow records was relatively expensive. On the other hand, their ability to configure the data store for high levels of parallel processing transparently offers a degree of flexibility and performance tuning not easily available with current SiLK cluster implementations. It is interesting to note that there is an almost direct logical correspondence between the current SiLK storage organization and one that can be achieved with Redshift using appropriate key fields and table definitions, and this provides a starting point for the analysis.
In this talk, John presents the results of experiments using a modest data set comprising on the order of a billion flow records. John begins with a discussion of the nature of the data store and its impact on data organization. Of particular interest are mechanisms for dealing with unsupported primitive data types such as IP addresses (especially IPv6 addresses) and more complex types such as sets and bags that appear frequently in his analyses. Since the selection of raw data records (or portions of them) for subsequent analysis and summarization appears to be the primary bottleneck in the current system, John concentrates on the functionality of selection programs such as rwfilter in the initial comparisons. The flexibility of the cloud-based services allows us to reconsider the current partitioning strategies and John experiments with both single and multiple table forms. The single table form includes the current partitioning criteria, sensor and flow type, in the sort key, allowing the query optimizer to elide irrelevant portions of the store at the block level. The multiple block form mimics the SiLK partitioning allowing irrelevant tables to be ignored in query formation.