CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines

Authors:

Saeed Fathollahzadeh, Essam Mansour, Matthias Boehm

Download PDF

Abstract

Data-centric machine learning (ML) pipelines extend traditional ML pipelines—of feature transformations, hyper-parameter tuning, and model training—by additional pre-processing steps for data cleaning, data augmentation, and feature engineering to create high-quality data with good coverage. Finding eﬀective data-centric ML pipelines is still a labor- and compute-intensive process though. While AutoML tools use eﬀective search strategies, they struggle to scale with large datasets. Large language models (LLMs) show promise for code generation but face challenges in generating datacentric ML pipelines due to private datasets not seen during training, complex pre-processing requirements, and the need for mitigating hallucinations. These demands exceed typical code generation as it requires actions tailored to the characteristics and requirements of a particular dataset. This paper introduces CatDB, a comprehensive, LLM-based system for generating eﬀective, error-free, and eﬃcient data-centric ML pipelines. CatDB leverages data catalog information and reﬁned metadata to dynamically create datasetspeciﬁc rules (instructions) to guide the LLM. Moreover, CatDB includes a robust mechanism for automatic validation and error handling of the generated pipeline. Our experimental results show that CatDB reliably generates eﬀective ML pipelines across diverse datasets, achieving accuracy comparable to or better than existing LLM-based systems, standalone AutoML tools, and combined workﬂows of data cleaning and AutoML tools, while delivering up to orders of magnitude faster performance on large datasets.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 8

CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines

Abstract