go back

Volume 18, No. 12

APEX-DAG: Library and Language independent Pipeline EXtraction

Authors:
Sebastian Eggers, Nina Żukowska, Ziawasch Abedjan

Abstract

Modern data-driven systems often rely on complex pipelines to process and transform data for downstream machine learning (ML) tasks. Extracting these pipelines and understanding their structure is critical for ensuring transparency, performance optimization, and maintainability, especially in large-scale projects. In this work, we introduce a novel system, APEX-DAG ( A utomating P ipeline EX traction with D ataflow, Static Code A nalysis, and G raph Attention Networks), which automates the extraction of data pipelines from computational notebooks or scripts. Unlike execution-based methods, APEX-DAG leverages static code analysis to identify the dataflow, transformations, and dependencies within ML workflows without executing the code or the need to alter the code. Further, after an initial training phase, our system can identify pipelines that built with previously unseen libraries.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy