Enterprise Data Storage and Analysis on Apache Spark
January 2015 • Presentation
In this presentation, Tim explores a formalized architecture utilizing Apache Spark to address data storage challenges.
This presentation explores a formalized architecture utilizing Apache Spark to address data storage challenges. Many of today’s analytics solutions require data to be stored in a proprietary format and/or a proprietary data store. This approach results in an organization storing multiple copies of large datasets in a variety of formats. This presentation discusses utilizing Apache Spark as the enterprise data repository for NetFlow and other enrichment datasets. In addition, we also discuss the benefits of taking a standards-based approach to data formats such as the open ontology for cyber data (Open Cyber Ontology Group). Apache Spark provides the ability through a single unified platform to allow for ad-hoc analysis, batch processing, streaming analytics, and graph algorithms. Spark SQL lets you query structured data as a distributed dataset in Spark through APIs in Python, Scala and Java. Spark SQL makes it easy to interact with your data through a familiar SQL query language. Spark Streaming gives you the ability to write streaming applications in Java and Scala the same way you write batch-processing jobs. MLib provides scalable machine learning capabilities for your datasets. GraphX allows you to interact with your data as a graph and apply graph algorithms, and it also allows the analyst/developer to write custom graph algorithms utilizing the Pregel API. Finally, this presentation explores the benefits of applying graph algorithms to flow data. We cover the concept of applying social network analysis to host communications and the potential benefits of this approach. We also discuss the benefits of enrichment data such as DNS, DHCP, Blacklist, Whitelist, Firewall Logs, Incident Management Database, Intrusion Detection System Logs, Passive DNS and more.