go back

Volume 18, No. 4

Discovering Approximate Inclusion Dependencies

Authors:
Qingdong Su, Zhikang Wang, Zijing Tan, Shuai Ma

Abstract

Inclusion dependencies ( INDs ) are widely used in data management tasks. The discovery techniques of INDs have thus received a lot of attention, for discovering INDs valid in data. However, real-world data quality issues may lead to partial violations of INDs . This paper makes the first effort to provide a comprehensive study on the discovery of approximate INDs ( AINDs ), aiming to identify INDs with error rates below a given threshold. This paper introduces a new definition of AIND based on deletion semantics, in addition to the existing definition based on insertion semantics. A discovery method is developed that can be configured to identify AINDs based on either of these semantics. The method combines partitioning techniques to handle tables that cannot all fit into memory simultaneously, with novel approaches to quantify AIND violations based on partitioned tables. To improve efficiency, the method employs a novel three-layer filtering structure and techniques that can potentially prune invalid candidate AINDs and identify valid AINDs without necessarily processing all tuples. We conduct an extensive experimental evaluation and verify the following: the proposed method significantly outperforms existing methods for AIND discovery based on insertion semantics, the AIND discoveries with insertion and deletion semantics can provide complementary results, and our discovery method can effectively deal with dirty dataset containing various types of errors.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy