go back

Volume 18, No. 12

RadlER: Deduplicated Sampling On-Demand

Authors:
Luca Zecchini, Ziawasch Abedjan, Vasilis Efthymiou, Giovanni Simonini

Abstract

Data practitioners often need to sample their datasets to produce representative subsets for their downstream tasks. Unfortunately, real-world datasets frequently contain duplicates, whose presence biases sampling and impacts the quality of the produced subsets, hence the outcome of downstream tasks. While deduplication is therefore fundamental, performing it on the entire dataset to run sampling on its cleaned version might be prohibitively expensive in terms of time and resources. Thus, we recently introduced RadlER, a solution to perform deduplicated sampling on-demand , i.e., to produce a clean sample of a dirty dataset incrementally, according to a target distribution of some subpopulations, by focusing the cleaning effort only on entities required to appear in the sample. In this demonstration, we interactively show how RadlER can support practitioners in their data science pipelines, allowing them to save a relevant amount of time and resources.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy