go back
go back
Volume 18, No. 10
Data Imputation with Limited Data Redundancy Using Data Lakes
Abstract
Data imputation is essential for many data science applications. Existing methods rely heavily on sufficient data redundancy from within-table values. However, many real-world datasets often lack such data redundancy, necessitating external data sources. In this paper, we introduce a retrieval-augmented imputation framework, LakeFill , which combines large language models (LLMs) and data lakes to address this challenge. Unlike existing “table-level” retrieval methods designed for question answering, which retrieve data in the granularity of tables, LakeFill performs fine-grained “tuple-level” retrieval, optimized specifically for data imputation at the tuple level. It encodes (possibly incomplete) tuples to capture nuanced similarities and differences, enabling effective identification of candidate tuples. A novel reranking method that integrates checklist-based training data annotation with stratified training group construction further refines the retrieved tuples. Finally, a reasoner with a novel two-stage confidence-aware imputation ensures reliable imputation results. Extensive experiments show that LakeFill significantly outperforms state-of-the-art methods for data imputation when there is limited data redundancy.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy