DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity

Authors:

Zekai Qian, Xiaoou Ding, Chen Wang, Hongzhi Wang

Download PDF

Abstract

Real-world datasets often suﬀer from multiple quality issues, hindering downstream model performance and increasing cleaning costs. To address this, we propose DemandClean , a reinforcement learning-based adaptive data cleaning framework that dynamically balances cleaning eﬀectiveness and operational costs. DemandClean explicitly considers data authenticity (alignment with real-world facts), diversity (richness of feature values), and downstream models’ noise tolerance. We categorize data errors as missing (reducing authenticity and diversity), semantic (aﬀecting only authenticity), and syntactic (aﬀecting authenticity but potentially increasing diversity). Based on these errors, DemandClean intelligently selects among Repair, Delete, or No actions, guided by error rates and model robustness. For interpretability, the framework visually distinguishes authenticity, diversity, and tolerance. Extensive experiments conﬁrm that DemandClean achieves near-optimal accuracy at substantially reduced preprocessing costs. Speciﬁcally, it reduces repair actions by 80.0% and deletions by 80.7% compared to “Repair All” strategies, while maintaining or even exceeding their predictive performance, thus oﬀering an interpretable, cost-eﬀective, and scalable solution for practical applications.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 12

DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity

Abstract