Duplicate Removal in Information System Dissemination.
Tak W. Yan, Hector Garcia-Molina:
Duplicate Removal in Information System Dissemination.
VLDB 1995: 66-77@inproceedings{DBLP:conf/vldb/YanG95,
author = {Tak W. Yan and
Hector Garcia-Molina},
editor = {Umeshwar Dayal and
Peter M. D. Gray and
Shojiro Nishio},
title = {Duplicate Removal in Information System Dissemination},
booktitle = {VLDB'95, Proceedings of 21th International Conference on Very
Large Data Bases, September 11-15, 1995, Zurich, Switzerland},
publisher = {Morgan Kaufmann},
year = {1995},
isbn = {1-55860-379-4},
pages = {66-77},
ee = {db/conf/vldb/YanG95.html},
crossref = {DBLP:conf/vldb/95},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
Abstract
Our experience with the SIFT [YGM95] information dissemination system (in use by over 7,000 users daily) has identified an important and generic disseminationproblem: duplicate information.
In this paper we explain why duplicates arise, we quantify the problem, and we discuss why it impairs information dissemination.
We then propose a Duplicate Removal Module (DRM) for an information dissemination system.
The removal of duplicates operates on a per user, per document basis - each document read by a user generates a request, or a duplicate restraint.
In wide-area environments, the number of restraints handled is very large.
We consider the implementation of a DRM, examining alternative algorithms and data structures that may be used.
We present a performance evaluation of the alternatives and answer important design questions such as: Which implementation is the best?
With "best" scheme, how expensive will duplicate removal be?
How much memory is required? How fast can restraints be processed?
Copyright © 1995 by the VLDB Endowment.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the VLDB
copyright notice and the title of the publication and
its date appear, and notice is given that copying
is by the permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires
a fee and/or special permission from the Endowment.
Online Paper
CDROM Version: Load the CDROM "Volume 1 Issue 5, VLDB '89-'97" and ...
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
Printed Edition
Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio (Eds.):
VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland.
Morgan Kaufmann 1995, ISBN 1-55860-379-4
Contents
References
- [BDGM95]
- Sergey Brin, James Davis, Hector Garcia-Molina:
Copy Detection Mechanisms for Digital Documents.
SIGMOD Conference 1995: 398-409
- [BLCGP92]
- Tim Berners-Lee, Robert Cailliau, Jean-François Groff, Bernd Pollermann:
World-Wide Web: The Information Universe.
Electronic Networking: Research, Applications and Policy 1(2): 74-82(1992)
- [Coh92]
- ...
- [Goy87]
- Pankaj Goyal:
Duplicate record identification in bibliographic databases.
Inf. Syst. 12(3): 239-242(1987)
- [HR79]
- ...
- [Kro92]
- ...
- [LT92]
- Shoshana Loeb, Douglas B. Terry:
Information Filtering - Preface to the Secial Section.
Commun. ACM 35(12): 26-28(1992)
- [ORO93]
- ...
- [Rei93]
- ...
- [Rid92]
- ...
- [Sal68]
- ...
- [SGM95]
- Narayanan Shivakumar, Hector Garcia-Molina:
SCAM: A Copy Detection Mechanism for Digital Documents.
DL 1995: 0-
- [YGM94a]
- Tak W. Yan, Hector Garcia-Molina:
Index Structures for Information Filtering Under the Vector Space Model.
ICDE 1994: 337-347
- [YGM94b]
- Tak W. Yan, Hector Garcia-Molina:
Index Structures for Selective Dissemination of Information Under the Boolean Model.
ACM Trans. Database Syst. 19(2): 332-364(1994)
- [YGM95]
- Tak W. Yan, Hector Garcia-Molina:
SIFT - a Tool for Wide-Area Information Dissemination.
USENIX Winter 1995: 177-186
Copyright © Mon Mar 15 03:55:55 2010
by Michael Ley (ley@uni-trier.de)