End-to-End Declarative Data Analytics: Co-designing Engines, Interfaces, and Cloud Infrastructure
Abstract
The declarative nature of the relational model and database engines shields users from system implementation complexity, system evolution, data representation details, and enables optimizations across workloads and use cases. However, recent trends in data management introduce significant challenges to declarative interfaces. For instance, these days, the engine might not own the data (e.g., data lakes), might have to deal with different data representations (e.g., CSV, Arrow, Parquet, Iceberg), and needs to support operations beyond relational SQL (e.g., vector databases, machine learning). To complicate matters, in the cloud, many layers of infrastructure — virtual machines (VMs), container schedulers, general-purpose OSes — stand between the engine and the compute/storage/network fabric. Today’s cloud infrastructure exposes only low-level, VM-centric knobs and treats analytics workloads as opaque binaries, leaving the engine with little visibility into data placement, hardware availability, and network congestion. We propose to extend the declarative approach to data management end-to-end. We start from a composite database engine and extend its declarative interfaces (SQL and query plans) down to the cloud execution layer. The database engine exposes each query as a dataflow graph, composed of functions (operators) that each declare their inputs, outputs, and available parallelism to the cloud platform. The cloud platform takes this graph as input and provides a declarative execution substrate responsible for resource scaling, data and operator placement, caching, and isolation mechanism implementation decisions. Our early prototype shows how rethinking the interface between the engine and the cloud platform enables elastic data-dependent parallel execution over data lakes, automatic caching, and opens new research directions for cloud analytics.