Distributed Hypertext Resource Discovery Through Examples.
Soumen Chakrabarti, Martin van den Berg, Byron Dom:
Distributed Hypertext Resource Discovery Through Examples.
VLDB 1999: 375-386@inproceedings{DBLP:conf/vldb/ChakrabartiBD99,
author = {Soumen Chakrabarti and
Martin van den Berg and
Byron Dom},
editor = {Malcolm P. Atkinson and
Maria E. Orlowska and
Patrick Valduriez and
Stanley B. Zdonik and
Michael L. Brodie},
title = {Distributed Hypertext Resource Discovery Through Examples},
booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
publisher = {Morgan Kaufmann},
year = {1999},
isbn = {1-55860-615-7},
pages = {375-386},
ee = {db/conf/vldb/ChakrabartiBD99.html},
crossref = {DBLP:conf/vldb/99},
bibsource = {DBLP, http://dblp.uni-trier.de}
We describe the architecture of a hypertext resource discovery system using a
relational database. Such a system can answer questions that combine page contents,
meta-data, and hyperlink structure in powerful ways, such as "find the number of links
from an environmental protection page to a page about oil and natural gas over the
last year." A key problem in populating the database in such a system is to discover
web resources related to the topics involved in such queries. We argue that a
keyword-based "find similar" search based on a giant all-purpose crawler is neither
necessary nor adequate for resource discovery. Instead we exploit the properties that
pages tend to cite pages with related topics, and given that a page u cites a
page about a desired topic, it is very likely that u cites additional desirable
pages. We exploit these properties by using a crawler controlled by two hypertext
mining programs: (1) a classifier that evaluates the relevance of a region of the web
to the user's interest (2) a distiller that evaluates a page as an access point for a
large neighborhood of relevant pages. Our implementation uses IBM's Universal Database,
not only for robust data storage, but also for integrating the computations of the
classifier and distiller into the database. This results in significant increase in
I/O efficiency: a factor of ten for the classifier and a factor of three for the
distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and
dynamically change crawling strategies. We report on experiments to establish that
our system is efficient, effective, and robust.
Copyright © 1999 by the VLDB Endowment.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the VLDB
copyright notice and the title of the publication and
its date appear, and notice is given that copying
is by the permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires
a fee and/or special permission from the Endowment.
Online Paper
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
Printed Edition
Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.):
VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK.
Morgan Kaufmann 1999, ISBN 1-55860-615-7
- [1]
- Chidanand Apté, Fred Damerau, Sholom M. Weiss:
Automated Learning of Decision Rules for Text Categorization.
ACM Trans. Inf. Syst. 12(3): 233-251(1994)

- [2]
- ...
- [3]
- Krishna Bharat, Andrei Z. Broder:
A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines.
Computer Networks 30(1-7): 379-388(1998)

- [4]
- Krishna Bharat, Monika Rauch Henzinger:
Improved Algorithms for Topic Distillation in a Hyperlinked Environment.
SIGIR 1998: 104-111

- [5]
- Sergey Brin, Lawrence Page:
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Computer Networks 30(1-7): 107-117(1998)

- [6]
- Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan:
Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies.
VLDB J. 7(3): 163-178(1998)

- [7]
- Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, Jon M. Kleinberg:
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
Computer Networks 30(1-7): 65-74(1998)

- [8]
- Soumen Chakrabarti, Byron Dom, Piotr Indyk:
Enhanced Hypertext Categorization Using Hyperlinks.
SIGMOD Conference 1998: 307-318

- [9]
- ...
- [10]
- ...
- [11]
- Donald D. Chamberlin:
A Complete Guide to DB2 Universal Database.
Morgan Kaufmann 1998, ISBN 1-55860-482-0

- [12]
- ...
- [13]
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page:
Efficient Crawling Through URL Ordering.
Computer Networks 30(1-7): 161-172(1998)

- [14]
- William W. Cohen:
Fast Effective Rule Induction.
ICML 1995: 115-123

- [15]
- ...
- [16]
- Paul De Bra, R. D. J. Post:
Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible.
Computer Networks and ISDN Systems 27(2): 183-192(1994)

- [17]
- Susan T. Dumais, John C. Platt, David Hecherman, Mehran Sahami:
Inductive Learning Algorithms and Representations for Text Categorization.
CIKM 1998: 148-155

- [18]
- Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, Hector Garcia-Molina:
Proximity Search in Databases.
VLDB 1998: 26-37

- [19]
- Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom:
Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System.
SIGMOD Conference 1995: 483

- [20]
- Thorsten Joachims, Dayne Freitag, Tom M. Mitchell:
Web Watcher: A Tour Guide for the World Wide Web.
IJCAI (1) 1997: 770-777

- [21]
- ...
- [22]
- Thomas Kistler, Hannes Marais:
WebL - A Programming Language for the Web.
Computer Networks 30(1-7): 259-270(1998)

- [23]
- Jon M. Kleinberg:
Authoritative Sources in a Hyperlinked Environment.
SODA 1998: 668-677

- [24]
- David Konopnicki, Oded Shmueli:
Information Gathering in the World-Wide Web: The W3QL Query Language and the W3QS System.
ACM Trans. Database Syst. 23(4): 369-410(1998)

- [25]
- ...
- [26/27]
- Alberto O. Mendelzon, Tova Milo:
Formal Models of Web Queries.
PODS 1997: 134-143

- [28]
- ...
- [29]
- Wayne Niblack, Xiaoming Zhu, James L. Hafner, Thomas M. Breuel, Dulce B. Ponceleon, Dragutin Petkovic, Myron Flickner, Eli Upfal, Sigfredo I. Nin, Sanghoon Sull, Byron Dom, Boon-Lock Yeo, Savitha Srinivasan, Dan Zivkovic, Mike Penner:
Updates to the QBIC System.
Storage and Retrieval for Image and Video Databases (SPIE) 1998: 150-161

- [30]
- ...
- [31]
- Jacques Savoy:
An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems.
Inf. Process. Manage. 32(2): 155-170(1996)

- [32]
- Loren G. Terveen, William C. Hill:
Finding and Visualizing Inter-Site Clan Graphs.
CHI 1998: 448-455

- [33]
- ...
Copyright © Tue Mar 16 02:22:08 2010
by Michael Ley (ley@uni-trier.de)