Tutorials

Data visualization and social data analysis
J. Heer, J. Hellerstein

Jeffrey Heer. Jeffrey Heer is an Assistant Professor of Computer Science at Stanford University, where his research focuses on human-computer interaction, interactive visualization, and social computing. His work has produced novel visualization techniques for exploring data, software tools that simplify visualization creation and customization, and collaborative analysis systems that leverage the insights of multiple analysts. He is the author of the prefuse and flare open-source visualization toolkits, currently in use by the visualization research community and numerous corporations. Over the years, he has also worked at Xerox PARC, IBM Research, Microsoft Research, and Tableau Software. He holds B.S., M.S., and Ph.D. degrees in Computer Science from the University of California, Berkeley.

Joseph M. Hellerstein. Joseph M. Hellerstein is a Professor of Computer Science at the University of California, Berkeley, whose research focuses on data management and distributed systems. His work has been recognized via awards including an Alfred P. Sloan Research Fellowship, MIT Technology Review's inaugural TR100 list, and two ACM-SIGMOD "Test of Time" awards. He has also held industrial posts including Director of Intel Research Berkeley, and Chief Scientist of Cohera Corporation. He serves on the technical advisory boards of a number of companies, including Swivel, a social data visualization website, and Greenplum, a parallel database system vendor.

Abstract. In order to produce value from data, we must make sense of it. Such sensemaking—turning data sets into knowledge—is the basic motivation for query processing and data mining research. Sensemaking is also a fundamental challenge in human-computer interaction. Holistic solutions to sensemaking must integrate large-scale data storage, access, and analysis tools with subjective and contextualized human judgments about the meaning and significance of patterns in the data. Visualization technologies have proven essential for helping people understand data, leveraging the human visual system to analyze large amounts of information. In this tutorial we will examine how visualization technology can be applied to support data analysis and sensemaking. We will begin by surveying techniques and algorithms for creating effective visualizations based on principles from graphic design, perceptual psychology, and cognitive science. We will discuss techniques for integrating visualization with query languages and large-scale data processing. Finally, we will discuss emerging developments in social data analysis. The intended audience is database researchers and practitioners who are interested in understanding visualization theory and techniques, and their integration with data processing. No prior knowledge of visualization is assumed, only familiarity with data modeling and management techniques.

Efficient Approximate Search on String Collections
M. Hadjieleftheriou, C. Li

Marios Hadjieleftheriou (AT&T Labs - Research). Marios Hadjieleftheriou is an Inventive Researcher at AT&T Labs - Research. He received his Ph.D. degree in Computer Science from the University of California, Riverside, and his B.S. in Computer Science from the National Technical University of Athens, Greece. He was also a Postdoctoral fellow at Boston University. His research interests include core database management and indexing, data mining, data stream management, and data privacy and security.

Chen Li (University of California, Irvine). Chen Li is an associate professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003 and a few other NSF grants. He was once a part-time Visiting Research Scientist at Google. His research interests are in the fields of data management and information search, including text search, data cleansing, data integration, and data warehousing.

Abstract. This tutorial provides a comprehensive overview of recent research progress on the important problem of approximate search in collections of strings. It aims at identifying existing search algorithms and selectivity-estimation techniques, as well as their merits and limitations. This problem is of great interest for a variety of applications, including data cleaning, query relaxation, and spell checking. The performance of approximate string searching algorithms is critical in these applications in order to be able to support very large dataset sizes and high query throughput. In addition, accurate selectivity estimation of approximate string queries is of equal importance for query optimization purposes. We will present a succinct summary of existing work, that will portray the latent relationships between different approaches for performing approximate string searches, hence giving a deeper understanding of the state-of-the-art. We will also contact a comparative study of the merits and pitfalls associated with different algorithms and techniques, that will help identify the right tool for the right problem.

Data fusion - Resolving Data Conflicts for Integration
X. Dong, F. Nauman
Download the tutorial slides

Xin Luna Dong (AT&T Labs - Research). Xin Luna Dong received a Bachelor's Degree in Computer Science from Nankai University in China in 1988, and a Master's Degree in Computer Science from Peking University in China in 2001. She obtained her Ph.D. in Compute Science and Engineering from University of Washington in 2007 and joined AT&T Labs--Research after graduation. Dr. Dong's research interests include databases, information retrieval and machine learning, with an emphasis on data integration, data cleaning, probabilistic data management, schema matching, personal information management, web search, web-service discovery and composition, and Semantic Web. Dr. Dong has led development of the Semex personal information management system, which won the best demo award (one of the top 3 demos) in Sigmod'05.

Felix Naumann (Hasso Plattner Institute). Felix Naumann studied mathematics at the Technical University of Berlin and received his diploma in 1997. As a member of the graduate school ``Distributed Information Systems'' at Humboldt-University of Berlin, he finished his PhD thesis in 2000. His dissertation in the area of data quality received the dissertation prize of the German Society of Informatics (GI) for the best dissertation in Germany in 2000. In the following two years Felix Naumann worked at the IBM Almaden Research Center. From 2003-2006 he was an assistant professor at Humboldt-University of Berlin heading the Information Integration group. Since 2006 he is a full professor at the Hasso Plattner Institute, which is affiliated with the university of Potsdam. There he heads the information systems department. His experience in the area of data integration and data fusion is demonstrated by many publications in that area and numerous relevant industrial cooperations. Felix Naumann has served in the program committee of many international conferences and has served as a reviewer for many journals. He is the associate editor of the ACM Journal on Data and Information Quality and will be the general chair of the International Conference on Information Quality (ICIQ) in 2009.

Abstract. Modern data management applications often require the integration of data sources and provision of a uniform interface for users to access data from different sources. Data integration systems face two folds of challenges. First, data from disparate sources are often heterogeneous, both at schema level and at instance level. Second, different sources can provide conflicting data. Conflicts can arise because of incomplete data, erroneous data, and out-of-date data. It is critical for data integration systems to resolve conflicts from various sources and identify true values from false ones. This tutorial focuses on data fusion, which addresses the second challenge by fusing records on the same real-world entity into a single record and resolving possible conflicts from different data sources. Data fusion plays an important role in data integration systems: it detects and removes dirty data and increases correctness of the integrated data. This tutorial gathers models, techniques, and systems of the wide but yet unconsolidated field of data fusion and presents them in a concise and consolidated manner. We provide an overview of the causes and challenges of data fusion and cover a wide set of techniques, from simple to advanced, to resolve data conflicts in different types of settings and systems.

Information Theory For Data Management
D. Srivastava, S. Venkatasubramanian

Divesh Srivastava (AT&T Labs-Research). Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech. from the Indian Institute of Technology, Bombay.

Suresh Venkatasubramanian (University of Utah). Suresh Venkatasubramanian is the John and Marva Warnock Assistant Professor at the University of Utah. He received his Ph.D from Stanford University, and his B. Tech from the Indian Institute of Technology, Kanpur. His research interests lie in algorithms, computational geometry, and large data mining and analysis.

Abstract. We are awash in data. The explosion in computing power and computing infrastructure allows us to generate multitudes of data, in differing formats, at different scales, and in inter-related areas. Data management is fundamentally about the harnessing of this data to extract information, discovering good representations of the information, and analyzing information sources to glean structure. Data management generally presents us with cost-benefit tradeoffs. If we store more information, we get better answers to queries, but we pay the price in terms of increased storage. Conversely, reducing the amount of information we store improves performance at the cost of decreased accuracy for query results. The ability to quantify information gain or loss can only improve our ability to design good representations, storage mechanisms, and analysis tools for data. Information theory provides us with the tools to quantify information in this manner. It was originally designed as a theory of data communication over noisy channels. However, it has more recently been used as an abstract domain-independent technique for representing and analyzing data. For example, entropy measures the degree of disorder in data and mutual information captures the idea of noisy relationships among data. In general, viewing information theory as a tool to express and quantify notions of information content and information transfer has been very successful as a way of extracting structure from data. In this tutorial, we will explore the use of information theory as part of a data representation and analysis toolkit. We will do this with illustrative examples that span a wide range of topics of interest to data management researchers and practitioners. We will also examine the computational challenges associated with information-theoretic primitives, indicating how they might be computed efficiently.

Column oriented Database Systems
D. J. Abadi, P. A. Boncz, S. Harizopoulos

Daniel J. Abadi (Yale). Daniel Abadi serves on the Yale computer science faculty as an Assistant Professor. Before joining Yale, he spent four years at the Massachusetts Institute of Technology where he published numerous papers on column-store databases, made significant contributions to the C-Store project, and wrote his Ph.D. dissertation on "Query Execution in Column-Oriented Database Systems." This Ph.D. thesis won the 2008 SIGMOD Jim Gray Doctoral Dissertation Award. Abadi has also been involved in Vertica, a column-store DBMS start-up, since its founding. In addition to the Jim Gray award, Abadi has been a recipient of a Churchill Scholarship, an NSF CAREER Award, and a VLDB best paper award.

Peter A. Boncz (CWI, Amsterdam). Peter A. Boncz Peter is a researcher in the database architecture research group (INS1) of CWI since 2002. He obtained his Ph.D. degree at the University of Amsterdam in 2002 with research on architecture-conscious column stores that resulted in the MonetDB system. He also co-started the DaMoN workshop series that has brought together architecture-conscious researchers at the last five editions of SIGMOD/PODS. Peter also was a co-founder of Data Distilleries BV, that used MonetDB in commercial data mining technology, and was acquired by SPSS in 2002. He recently founded VectorWise, a new column-store start-up.

Stavros Harizopoulos (HP Labs). Stavros is a researcher in the Intelligent Information Management Lab at HP Labs, which is focused on enabling near real-time business intelligence with robust, scalable data management. He received his Ph.D. degree in Computer Science from Carnegie Mellon University in 2005, and until 2007 he worked as a Post-Doctoral researcher at the Database group of MIT, where he contributed to the C-Store project. In addition to column-oriented database systems, Stavros's research interests are in energy-efficient data management systems, query processing on new processor and storage technologies, and main-memory transaction processing.

Abstract. Column-oriented database systems (column-stores) have attracted a lot of attention in the past few years. Column-stores, in a nutshell, store each database table column separately, with attribute values belonging to the same column stored contiguously, compressed, and densely packed, as opposed to traditional database systems that store entire records (rows) one after the other. Reading a subset of a table's columns becomes faster, at the potential expense of excessive disk-head seeking from column to column for scattered reads or updates. After several dozens of research papers published and at least a dozen of new column-store start-ups, several questions remain. Are these a new breed of systems or simply old wine in new bottles? How easily can a major row-based system achieve column-store performance? Are column-stores the answer to effortlessly support large-scale data-intensive applications? What are the new, exciting system research problems to tackle? What are the new applications that can be potentially enabled by column-stores? In this tutorial, we will give an overview of column-oriented database system technology and address these and other related questions.

Keyword querying and Ranking in Databases
S. Chaudhuri, G. Das

Surajit Chaudhuri (Microsoft Research). Surajit Chaudhuri is a Principal Researcher and a Research Area Manager at Microsoft Research, Redmond. He has worked in the areas of Query Optimization, Physical Database Design, and Data Cleaning. He is also interested in the problem of search and querying information exploiting IR as well as DBMS techniques. His research on physical database design and data cleaning has been incorporated in the Microsoft SQL Server product. Surajit did his Ph.D. from Stanford University and worked at Hewlett-Packard Laboratories, Palo Alto from 1991-1995. He is an ACM Fellow. He was awarded the ACM SIGMOD Contributions award in 2004 and a 10 year VLDB Best paper Award in 2007.

Gautam Das (University of Texas At Arlington). Gautam Das is an Associate Professor and Head of the Database Exploration Laboratory (DBXLAB) at the CSE department of the University of Texas at Arlington. Prior to joining UTA in Fall 2004, Dr. Das has held positions at Microsoft Research, Compaq Corporation and the University of Memphis. He did his Ph.D. from the University of Wisconsin, Madison. Dr. Das's research interests span data mining, information retrieval, databases, algorithms and computational geometry. He is currently interested in ranking, top-k query processing, and sampling problems in databases, as well as data management problems in P2P and sensor networks, social networks, blogs and web communities. Dr. Das's research has been supported by grants from National Science Foundation, Office of Naval Research, Microsoft Research, Nokia Research, Cadence Design Systems and Apollo Data Technologies.

Abstract. With the proliferation of online databases, simple ways of exploring contents of such databases is of increasing importance. Examples include users wishing to search databases and catalogs of products such as homes, cars, cameras, restaurants, photographs. A popular paradigm for tackling this problem is to allow users to query such databases in the same ways as they explore Web documents. In other words, it is desirable to extend techniques such as keyword querying and automated result ranking. However, the complex structure present in databases makes a direct adaptation of information retrieval techniques difficult. This problem has attracted a lot of research as it has a rich set of challenges from defining semantics of such querying model to developing algorithms that ensure adequate performance. In this tutorial, we focus on the fundamental developments in
this field.