go back

Volume 18, No. 11

TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations

Authors:
Arash Dargahi Nobari, Davood Rafiei

Abstract

The integration of tabular data from diverse sources is often hindered by inconsistencies in formatting and representation, posing significant challenges for data analysts and personal digital assistants. Existing methods for automating tabular data transformations are limited in scope, often focusing on specific types of transformations or lacking interpretability. In this paper, we introduce TabulaX, a novel framework that leverages Large Language Models (LLMs) for multi-class column-level tabular transformations. TabulaX first classifies input columns into four transformation types— string-based, numerical, algorithmic, and general—and then applies tailored methods to generate human-interpretable transformation functions, such as numeric formulas or programming code. This approach enhances transparency and allows users to understand and modify the mappings. Through extensive experiments on realworld datasets from various domains, we demonstrate that TabulaX outperforms existing state-of-the-art approaches in terms of accuracy, supports a broader class of transformations, and generates interpretable transformations that can be efficiently applied. KEYWORDS Large Language Models, Heterogeneous Table Join, Data Integration, Data Transformation, Data Cleaning and Transformation

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy