Unlocking the Power of CI/CD for Data Pipelines in Distributed Data Warehouses

Authors:

Hongtao Yang, Zhichen Xu, Sergey Yudin, Andrew Davidson

Abstract

Ensuring the reliability of data pipelines is critical for modern datadriven organizations, yet building robust Continuous Integration (CI) in large, distributed data warehouses remains a signi!cant challenge. Complexities arising from distributed ownership, the high cost of replicating production environments, and the rapid evolution of business logic lead to fragile pipelines and costly failures. This paper introduces a novel CI framework designed to conquer these challenges, achieving 94.5% pre-production issue detection in YouTube’s data warehouse while dramatically reducing resource consumption. Our key innovation lies in a productioncon!guration-driven testing methodology, that enables scalable, isolated testing directly within the production environment. This approach reduces testing overhead and ensures high test !delity. Furthermore, we present a lineage-aware impact analysis framework that automatically propagates data quality checks across distributed pipeline components based on an algebraic dependency model, ensuring data consistency and promoting cross-team collaboration. This production-proven solution provides a practical blueprint for CI/CD in complex, large-scale environments.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 12

Unlocking the Power of CI/CD for Data Pipelines in Distributed Data Warehouses

Abstract