Data Imputation with Limited Data Redundancy Using Data Lakes

Authors:

Chenyu Yang, Yuyu Luo, Chuanxuan Cui, Ju Fan, Chengliang Chai, Nan Tang

Download PDF

Abstract

Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill , which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing “table-level” retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained “tuple-level” retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 10

Data Imputation with Limited Data Redundancy Using Data Lakes

Abstract