Semantic Compression and Pattern Extraction with Fascicles.

H. V. Jagadish, J. Madar, Raymond T. Ng: Semantic Compression and Pattern Extraction with Fascicles. VLDB 1999: 186-198

@inproceedings{DBLP:conf/vldb/JagadishMN99,
  author    = {H. V. Jagadish and
               J. Madar and
               Raymond T. Ng},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Semantic Compression and Pattern Extraction with Fascicles},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {186-198},
  ee        = {db/conf/vldb/JagadishMN99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Often many recoords in a database share similar values for several attributes. If one is able to identify and group together records that share similar values for some - even if not all - attributes, one can both obtain a more parsimonious representation of the data, and gain useful insight into the data from a mining perspective.

In this paper, we introduce the notion of fascicles. A fascicle F(k,t) is a subset of records that have k compact attributes. An attribute A of a collection F of records is compact if the width of the range of A-values (for numeric attributes) or the number of distinct A-values (for categorial attributes) of all the records in F does not exceed t. We introduce and study two problems related to fascicles. First, we consider how to find fascicles such that the total storage of the relation is minimized. Second, we study how best to extract fascicles whose sizes exceed a given minimum threshold (i.e., support) and that represent patterns of maximal quality, where quality is measured by the pair (k,t). We develop algorithms to attack both of the above problems. We show that these two problems are very hard to solve optimally. But we demonstrate empirically that good solutions can be obtained using our algorithms.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents

References

[1]: Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998: 94-105
[2]: Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993: 207-216
[3]: Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499
[4]: Daniel Barbará, William DuMouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond T. Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4): 3-45(1997)
[5]: Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur: Dynamic Itemset Counting and Implication Rules for Market Basket Data. SIGMOD Conference 1997: 255-264
[6]: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231
[7]: Christos Faloutsos, King-Ip Lin: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995: 163-174
[8]: ...
[9]: ...
[10]: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Conference 1998: 73-84
[11]: Theodore Johnson, Tamraparni Dasu: Comparing Massive High-Dimensional Data Sets. KDD 1998: 229-233
[12]: ...
[13]: Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu Toivonen, A. Inkeri Verkamo: Finding Interesting Rules from Large Sets of Discovered Association Rules. CIKM 1994: 401-407
[14]: Flip Korn, H. V. Jagadish, Christos Faloutsos: Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. SIGMOD Conference 1997: 289-300
[15]: ...
[16]: Heikki Mannila, Hannu Toivonen: Levelwise Search and Borders of Theories in Knowledge Discovery. Data Min. Knowl. Discov. 1(3): 241-258(1997)
[17]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155
[18]: Wee Keong Ng, Chinya V. Ravishankar: Block-Oriented Compression Techniques for Large Statistical Databases. IEEE Trans. Knowl. Data Eng. 9(2): 314-328(1997)
[19]: Frank Olken, Doron Rotem: Rearranging Data to Maximize the Efficiency of Compression. PODS 1986: 78-90
[20]: Michael B. Smyth: Power Domains. J. Comput. Syst. Sci. 16(1): 23-36(1978)
[21]: Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114