go back

Volume 18, No. 8

Evaluating Methods for Efficient Entity Count Estimation

Authors:
Jerin George Mathew, Donatella Firmani, Divesh Srivastava

Abstract

The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count . This problem has received attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy