Determining Text Databases to Search in the Internet.

Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, Naphtali Rishe: Determining Text Databases to Search in the Internet. VLDB 1998: 14-25

@inproceedings{DBLP:conf/vldb/MengLYWCR98,
  author    = {Weiyi Meng and
               King-Lup Liu and
               Clement T. Yu and
               Xiaodong Wang and
               Yuhsi Chang and
               Naphtali Rishe},
  editor    = {Ashish Gupta and
               Oded Shmueli and
               Jennifer Widom},
  title     = {Determining Text Databases to Search in the Internet},
  booktitle = {VLDB'98, Proceedings of 24rd International Conference on Very
               Large Data Bases, August 24-27, 1998, New York City, New York,
               USA},
  publisher = {Morgan Kaufmann},
  year      = {1998},
  isbn      = {1-55860-566-5},
  pages     = {14-25},
  ee        = {db/conf/vldb/MengLYWCR98.html},
  crossref  = {DBLP:conf/vldb/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accuratelypredict the usefulness of each database, because with such information, weonly need to retrieve potentially useful documents from useful databases. In this paper, we propose two new methods for estimating the usefulness oftext databases. For a given query, the usefulness of a text database in this paper is defined to be the number of documents in the database that aresufficiently similar to the query. Such a usefulness measure enables naive-users to make informed decision about which databases to search. We also consider the collection fusion problem. Because local databases may employsimilarity functions that are different from that used by the global database, the threshold used by a local database to determine whether a document is potentially useful may be different from that used by the global database. We provide techniques that determine the best threshold for a given local database.

Copyright © 1998 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the CD and click on the path of your mount point:
/Anthology/Disc99_1 or /cdrom

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

Printed Edition

Ashish Gupta, Oded Shmueli, Jennifer Widom (Eds.): VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA. Morgan Kaufmann 1998, ISBN 1-55860-566-5
Contents

References

[ALSF97]: ...
[BuSA93]: ...
[CLBC95]: James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28
[DuHa73]: ...
[Gass69]: ...
[GrGM95a]: Luis Gravano, Hector Garcia-Molina: Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. VLDB 1995: 78-89
[GrGM95b]: ...
[GrGM97]: Luis Gravano, Hector Garcia-Molina: Merging Ranks from Heterogeneous Internet Sources. VLDB 1997: 196-205
[Harm93]: ...
[HoDr97]: ...
[KaMe91]: ...
[Kost94]: Martijn Koster: ALIWEB - Archie-like Indexing in the WEB. Computer Networks and ISDN Systems 27(2): 175-182(1994)
[Kow97]: ...
[LaYu82]: K. Lam, Clement T. Yu: A Clustered Search Algorithm Incorporating Arbitrary Term Dependencies. ACM Trans. Database Syst. 7(3): 500-508(1982)
[MaBi97]: ...
[MLYW98]: ...
[NCS]: ...
[SaMc83]: Gerard Salton, Michael McGill: Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0
[Salt89]: Gerard Salton: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley 1989, ISBN 0-201-12227-8
[SeEt95]: ...
[SeEt97]: ...
[TVGJ95]: ...
[VGJL95]: Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird: Learning Collection Fusion Strategies. SIGIR 1995: 172-179
[Widd89]: ...
[YaGM95]: Tak W. Yan, Hector Garcia-Molina: SIFT - a Tool for Wide-Area Information Dissemination. USENIX Winter 1995: 177-186
[YuLS78]: Clement T. Yu, W. S. Luk, M. K. Siu: On the Estimation of the Number of Desired Records with Respect to a Given Query. ACM Trans. Database Syst. 3(1): 41-56(1978)
[YuLe97]: Budi Yuwono, Dik Lun Lee: Server Ranking for Distributed Text Retrieval Systems on the Internet. DASFAA 1997: 41-50