UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow

Authors:

Xiaoou Ding, Zekai Qian, Hongzhi Wang, Siying Chen, Yafeng Tang, Hongbin Su, Huan Hu, Chen Wang

Download PDF

Abstract

Data cleaning is an essential technique to enhance data quality. Despite the proposal of various algorithms with diﬀerent cleaning strategies, current automated cleaning technologies still fall short of practical requirements when dealing with large-scale data containing mixed errors. This paper presents UniClean to eﬃciently solve the mixed error cleaning problem with three key technical contributions. (1) A uniﬁed construction and extension method for cleaners, enabling cleaning methods to easily utilize various cleaners to perform cleaning tasks. (2) Three optimization strategies to achieve eﬃciency-oriented cleaning preparation. (3) A cleaning algorithm based on an optimized cleaning process to effectively clean mixed errors. UniClean achieves a time complexity of O(| D error | 4 · | Op| + | D | · | D error |) , signiﬁcantly enhancing scalability. Experiments on public and large-scale enterprise datasets demonstrate that UniClean achieves over 40% improvement across ﬁve metrics, compared to ﬁve state-of-the-art cleaning methods, and delivers more than 30% gains in F1 and REDR on complex datasets, while completing the cleaning process within hours even for millions of records.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 11

UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow

Abstract