go back
go back
Volume 18, No. 11
UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow
Abstract
Data cleaning is an essential technique to enhance data quality. Despite the proposal of various algorithms with different cleaning strategies, current automated cleaning technologies still fall short of practical requirements when dealing with large-scale data containing mixed errors. This paper presents UniClean to efficiently solve the mixed error cleaning problem with three key technical contributions. (1) A unified construction and extension method for cleaners, enabling cleaning methods to easily utilize various cleaners to perform cleaning tasks. (2) Three optimization strategies to achieve efficiency-oriented cleaning preparation. (3) A cleaning algorithm based on an optimized cleaning process to effectively clean mixed errors. UniClean achieves a time complexity of O(| D error | 4 · | Op| + | D | · | D error |) , significantly enhancing scalability. Experiments on public and large-scale enterprise datasets demonstrate that UniClean achieves over 40% improvement across five metrics, compared to five state-of-the-art cleaning methods, and delivers more than 30% gains in F1 and REDR on complex datasets, while completing the cleaning process within hours even for millions of records.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy