go back
go back
Volume 18, No. 12
DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity
Abstract
Real-world datasets often suffer from multiple quality issues, hindering downstream model performance and increasing cleaning costs. To address this, we propose DemandClean , a reinforcement learning-based adaptive data cleaning framework that dynamically balances cleaning effectiveness and operational costs. DemandClean explicitly considers data authenticity (alignment with real-world facts), diversity (richness of feature values), and downstream models’ noise tolerance. We categorize data errors as missing (reducing authenticity and diversity), semantic (affecting only authenticity), and syntactic (affecting authenticity but potentially increasing diversity). Based on these errors, DemandClean intelligently selects among Repair, Delete, or No actions, guided by error rates and model robustness. For interpretability, the framework visually distinguishes authenticity, diversity, and tolerance. Extensive experiments confirm that DemandClean achieves near-optimal accuracy at substantially reduced preprocessing costs. Specifically, it reduces repair actions by 80.0% and deletions by 80.7% compared to “Repair All” strategies, while maintaining or even exceeding their predictive performance, thus offering an interpretable, cost-effective, and scalable solution for practical applications.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy