ACM SIGMOD Anthology VLDB dblp.uni-trier.de

Sampling-Based Estimation of the Number of Distinct Values of an Attribute.

Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes: Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995: 311-322
@inproceedings{DBLP:conf/vldb/HaasNSS95,
  author    = {Peter J. Haas and
               Jeffrey F. Naughton and
               S. Seshadri and
               Lynne Stokes},
  editor    = {Umeshwar Dayal and
               Peter M. D. Gray and
               Shojiro Nishio},
  title     = {Sampling-Based Estimation of the Number of Distinct Values of
               an Attribute},
  booktitle = {VLDB'95, Proceedings of 21th International Conference on Very
               Large Data Bases, September 11-15, 1995, Zurich, Switzerland},
  publisher = {Morgan Kaufmann},
  year      = {1995},
  isbn      = {1-55860-379-4},
  pages     = {311-322},
  ee        = {db/conf/vldb/HaasNSS95.html},
  crossref  = {DBLP:conf/vldb/95},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

We provide several new sampling-based estimators of the number of distinctvalues of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highly- skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new "hybrid" estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new "smoothed jackknife" estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size ofthe database.

Copyright © 1995 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.


Online Paper

ACM SIGMOD Anthology

CDROM Version: Load the CDROM "Volume 1 Issue 5, VLDB '89-'97" and ... DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Printed Edition

Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio (Eds.): VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland. Morgan Kaufmann 1995, ISBN 1-55860-379-4
Contents CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

References

[AS72]
...
[ASW87]
Morton M. Astrahan, Mario Schkolnick, Kyu-Young Whang: Approximating the number of unique values of an attribute without sorting. Inf. Syst. 12(1): 11-15(1987) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[BFi93]
...
[BO78]
...
[BO79]
...
[BFe93]
Quentin L. Burrell, Michael R. Fenton: Yes, the GIGP Really Does Work - and Is Workable! JASIS 44(2): 61-69(1993) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Cha84]
...
[CL92]
...
[ET93]
...
[FM85]
Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209(1985) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[GG82]
Erol Gelenbe, Danièle Gardy: On the Size of Projections: I. Inf. Process. Lett. 14(1): 18-21(1982) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Goo49]
...
[HN+93]
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Arun N. Swami: Selectivity and Cost Estimation for Joins Based on Random Sampling. J. Comput. Syst. Sci. 52(3): 550-569(1996) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[HS94]
Joseph M. Hellerstein, Michael Stonebraker: Predicate Migration: Optimizing Queries with Expensive Predicates. SIGMOD Conference 1993: 267-276 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[HF83]
...
[HOT88]
Wen-Chi Hou, Gultekin Özsoyoglu, Baldeo K. Taneja: Statistical Estimators for Relational Algebra Expressions. PODS 1988: 276-287 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[HOT89]
Wen-Chi Hou, Gultekin Özsoyoglu, Baldeo K. Taneja: Processing Aggregate Relational Queries with Hard Time Constraints. SIGMOD Conference 1989: 68-77 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[LP56]
...
[NS90]
Jeffrey F. Naughton, S. Seshadri: On Estimating the Size of Projections. ICDT 1990: 499-513 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Olk93]
...
[ODT91]
Gultekin Özsoyoglu, Kaizheng Du, A. Tjahjana, Wen-Chi Hou, D. Y. Rowland: On Estimating COUNT, SUM, and AVERAGE. DEXA 1991: 406-412 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[SSW92]
...
[SAC+79]
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Shl81]
...
[Si86a]
...
[Si86b]
...
[Si92]
H. S. Sichel: Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies. Inf. Process. Manage. 28(1): 5-18(1992) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[SvB84]
...
[WVT90]
Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor: A Linear-Time Probabilistic Counting Algorithm for Database Applications. ACM Trans. Database Syst. 15(2): 208-229(1990) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

Copyright © Fri Mar 12 17:22:54 2010 by Michael Ley (ley@uni-trier.de)