Similarity Search in High Dimensions via Hashing.

Aristides Gionis, Piotr Indyk, Rajeev Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: 518-529

@inproceedings{DBLP:conf/vldb/GionisIM99,
  author    = {Aristides Gionis and
               Piotr Indyk and
               Rajeev Motwani},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Similarity Search in High Dimensions via Hashing},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {518-529},
  ee        = {db/conf/vldb/GionisIM99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the "curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should suffice for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives significant improvement in running time over other methods for searching in high-dimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50).

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents

References

[1]: ...
[2]: Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Y. Wu: An Optimal Algorithm for Approximate Nearest Neighbor Searching. SODA 1994: 573-582
[3]: Jon Louis Bentley: Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18(9): 509-517(1975)
[4]: Stefan Berchtold, Daniel A. Keim: High-Dimensional Index Structures, Database Support for Next Decade's Applications (Tutorial). SIGMOD Conference 1998: 501
[5]: ...
[6]: Timothy M. Chan: Approximate Nearest Neighbor Queries Revisited. Symposium on Computational Geometry 1997: 352-358
[7]: R. Scott Cost, Steven Salzberg: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning 10: 57-78(1993)
[8]: Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya: Random Sampling for Histogram Construction: How much is enough? SIGMOD Conference 1998: 436-447
[9]: ...
[10]: Paolo Ciaccia, Marco Patella, Pavel Zezula: A Cost Model for Similarity Queries in Metric Spaces. PODS 1998: 59-68
[11]: Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman: Indexing by Latent Semantic Analysis. JASIS 41(6): 391-407(1990)
[12]: ...
[13]: ...
[14]: ...
[15]: Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, William Equitz: Efficient and Effective Querying by Image Content. J. Intell. Inf. Syst. 3(3/4): 231-262(1994)
[16]: ...
[17]: Myron Flickner, Harpreet S. Sawhney, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker: Query by Image and Video Content: The QBIC System. IEEE Computer 28(9): 23-32(1995)
[18]: Jerome H. Friedman, Jon Louis Bentley, Raphael A. Finkel: An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Trans. Math. Softw. 3(3): 209-226(1977)
[19]: ...
[20]: ...
[21]: Trevor Hastie, Robert Tibshirani: Discriminant Adaptive Nearest Neighbor Classification. KDD 1995: 142-149
[22]: ...
[23]: ...
[24]: Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613
[25]: Kothuri Venkata Ravi Kanth, Divyakant Agrawal, Ambuj K. Singh: Dimensionality Reduction for Similarity Searching in Dynamic Databases. SIGMOD Conference 1998: 166-176
[26]: ...
[27]: ...
[28]: Norio Katayama, Shin'ichi Satoh: The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. SIGMOD Conference 1997: 369-380
[29]: Nathan Linial, Eran London, Yuri Rabinovich: The geometry of graphs and some of its algorithmic applications. FOCS 1994: 577-591
[30]: ...
[31]: ...
[32]: ...
[33]: Gurmeet Singh Manku, Sridhar Rajagopalan, Bruce G. Lindsay: Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998: 426-435
[34]: Yossi Matias, Jeffrey Scott Vitter, Min Wang: Wavelet-Based Histograms for Selectivity Estimation. SIGMOD Conference 1998: 448-459
[35]: Rajeev Motwani, Prabhakar Raghavan: Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5
[36]: ...
[37]: Alex Pentland, Rosalind W. Picard, Stan Sclaroff: Photobook: Tools for Content-Based Manipulation of Image Databases. Storage and Retrieval for Image and Video Databases (SPIE) 1994: 34-47
[38]: Gerard Salton, Michael McGill: Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0
[39]: Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
[40]: ...
[41]: Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: Multidimensional Access Methods: Trees Have Grown Everywhere. VLDB 1997: 13-14
[42]: ...
[43]: John R. Smith, Shih-Fu Chang: Visually Searching the Web for Content. IEEE MultiMedia 4(3): 12-20(1997)
[44]: ...
[45]: Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205