This paper was published by the ACM in the Proceedings of the 2015 Workshop on Challenges in Performance Methods for Software Development in January 2015.
Big data systems are becoming pervasive. They are distributed systems that include redundant processing nodes, replicated storage, and frequently execute on a shared 'cloud' infrastructure. For these systems, design-time predictions are insufficient to assure runtime performance in production. This is due to the scale of the deployed system, the continually evolving workloads, and the unpredictable quality of service of the shared infrastructure. Consequently, a solution for addressing performance requirements needs sophisticated runtime observability and measurement. Observability gives real-time insights into a system's health and status, both at the system and application level, and provides historical data repositories for forensic analysis, capacity planning, and predictive analytics. Due to the scale and heterogeneity of big data systems, significant challenges exist in the design, customization and operations of observability capabilities. These challenges include economical creation and insertion of monitors into hundreds or thousands of computation and data nodes, efficient, low overhead collection and storage of measurements (which is itself a big data problem), and application-aware aggregation and visualization. In this paper we propose a reference architecture to address these challenges, which uses a model-driven engineering toolkit to generate architecture-aware monitors and application-specific visualizations.