CIDR Proceedings

This website is under development. If you come accross any issues, please report them to Konstantinos Kanellis (kkanellis@cs.wisc.edu) or Yannis Chronis (chronis@google.com).

Go Back

Semi-Supervised Data Cleaning with Raha and Baran

Authors:

Mohammad Mahdavi, Ziawasch Abedjan

Download PDF

Abstract

Data cleaning is a tedious data preparation task, which typically needs user supervision in the form of predeﬁned conﬁgurations, such as rules, parameters, or patterns. We have recently developed two conﬁguration-free systems, Raha and Baran, to detect and correct data errors in a semi-supervised manner. In this paper, we demonstrate how both systems can be used within an end-to-end data cleaning pipeline. Our demonstration shows how user supervision can be reduced to a negligible amount of example corrections using effective feature representation, label propagation, and transfer learning methods. While each cleaning step, detection and correction, faces substantially diﬀerent challenges, we have designed the corresponding systems based on the same intuition. Both systems internally leverage an automatically generatable set of base detectors and correctors and learn to combine them using a few user labels. In practice, with a small number of 20 user-annotated tuples, it is possible to eﬀectively identify and ﬁx data quality problems inside a dataset. Furthermore, both systems beneﬁt from knowledge of prior cleaning tasks. Using transfer learning, both systems can optimize the data cleaning task at hand in terms of error detection runtime and error correction eﬀectiveness.