Tutorials

  1. New Frontiers in Business Intelligence
  2. System Co-Design and Data Management for Flash Devices
  3. Exploration of Deep Web Repositories
  4. Crowdsourcing Applications and Platforms: A Data Management Perspective
  5. Graph Data Management Systems for New Application Domains
  6. Information Diffusion In Social Networks: Observing and Influencing Societal Interests

New Frontiers in Business Intelligence
Surajit Chaudhuri and Vivek Narasayya

Abstract. Business intelligence (BI) software is a collection of decision support technologies for the enterprise aimed at enabling knowledge workers such as executives, managers and analysts to make better and faster decisions. When compared to the BI landscape of the mid-90s we observe that today’s BI technology has progressed well beyond OLAP servers, parallel DBMSs, and classical ETL of that era. Several new frontiers in BI have emerged including “Big Data” engines, near real-time BI, predictive analytics, enterprise search and cloud data services. In this tutorial, we will first provide a broad overview of the current BI landscape identifying important use cases, describing key technologies, and highlighting relationships across these technologies. Next, we drill-down into the newly emerging frontiers of BI discussed above. For each area we discuss the state-of-the-art, highlight architectural considerations and present open research problems.

Surajit Chaudhuri is a Research Manager at Microsoft Research, Redmond. He started the AutoAdmin project on self-tuning database systems at Microsoft Research. Surajit has also worked in the area of data cleaning. His research on both physical database design and data cleaning has been incorporated in Microsoft products and services such as Microsoft SQL Server and Bing. Surajit received his Ph.D. from Stanford University and is an ACM Fellow. He was awarded the ACM SIGMOD Contributions award in 2004, a VLDB 10-year Best paper Award in 2007 and the ACM SIGMOD Innovations Award in 2011.

Vivek Narasayya is a Principal Researcher at Microsoft Research, Redmond. He is interested broadly in data management, focusing on the areas of self-tuning database systems, query processing, query optimization, and resource management in databases. He did his Ph.D. from the University of Washington, Seattle. He was awarded a VLDB 10-year Best paper Award in 2007.


System Co-Design and Data Management for Flash Devices
Philippe Bonnet, Luc Bouganim, Ioannis Koltsidas, and Stratis D. Viglas

Abstract. Flash devices are emerging as a replacement for disks. How does this evolution impact the design of data management systems? While flash devices have been available for years, this question is still open. In this tutorial, we share two views on the development of data management systems for flash devices. The first view considers that flash devices introduce so much complexity that it is necessary to reconsider the strictly layered approach between storage system, operating system and data management system. The second view considers that data management systems should recognize the complexity of flash devices and leverage the characteristics of different classes of devices for different usage patterns. Throughout the tutorial, we will cover the data management stack: from the fundamentals of flash technology, through storage for database systems and the manipulation of flashresident data, to query processing.

Philippe Bonnet is associate professor at IT University of Copenhagen; his research interests lie in the area of data management and system performance. Philippe currently serves as chair for the SIGMOD repeatability committee. Philippe is co-author of a book on database tuning together with Dennis Shasha, NYU.

Luc Bouganim is research director at INRIA Paris-Rocquencourt and vicehead of the SMIS project, which focuses on secured and mobile information systems. Luc co-authored more than 50 international conference and journal papers. Since 2000, his research interests lie in the area of data management on specific hardware architecture, and more precisely on secure chips. Luc got best paper awards at VLDB in 2000 for his work on PicoDBMS, a database system for smartcard.

Ioannis Koltsidas is a Research Staff Member in the Storage Technologies Department of IBM Research – Zurich. He received a Ph.D. in Computer Science from the University of Edinburgh, UK, in 2010, and a B.Sc. degree in Electrical and Computer Engineering from the National Technical University of Athens, Greece, in 2006. His research interests lie in the areas of systems storage and database storage, with an emphasis on buffer management, caching and automatic tiering. His current focus is on high-performance, high-scalability storage systems employing solid-state storage technologies.

Stratis D. Viglas is a Reader at the School of Informatics of the University of Edinburgh. He received a PhD in Computer Science from the University of Wisconsin-Madison in 2003, and BSc and MSc degrees in Informatics from the University of Athens, Greece, in 1996 and 1999. His research interests lie in the areas of query processing and optimization, data storage and indexing, data stream management, and distributed computing.


Exploration of Deep Web Repositories
Nan Zhang and Gautam Das

Abstract. With the proliferation of online repositories (e.g., databases or document corpora) hidden behind proprietary web interfaces, e.g., keyword-/form-based search and hierarchical/graph-based browsing interfaces, efficient ways of exploring contents in such hidden repositories are of increasing importance. There are two key challenges: one on the proper understanding of interfaces, and the other on the efficient exploration, e.g., crawling, sampling and analytical processing, of very large repositories. In this tutorial, we focus on the fundamental developments in the field, including web interface understanding, crawling, sampling, and data analytics over web repositories with various types of interfaces and containing structured or unstructured data. Our goal is to encourage audience to initiate their own research in these exciting areas.

Dr. Nan Zhang is an Assistant Professor of Computer Science at the George Washington University, Washington, DC, USA. Prior to joining GWU, he was an assistant professor of Computer Science and Engineering at the University of Texas at Arlington from 2006 to 2008. He received the B.S. degree from Peking University in 2001 and the Ph.D. degree from Texas A&M University in 2006, both in computer science. His current research interests include databases and information security/privacy. He received the NSF CAREER award in 2008.

Gautam Das is a Full Professor in the Computer Science and Engineering Department of the University of Texas at Arlington. Prior to UTA, Dr Das has held positions at Microsoft Research, Compaq Corporation and the University of Memphis, as well as visiting positions at IBM Research. He graduated with a BTech in computer science from IIT Kanpur, India, and with a PhD in computer science from the University of Wisconsin-Madison. Dr. Das’s research interests span data mining, information retrieval, databases, approximate query processing, applied graph and network algorithms, and computational geometry. His research has resulted in over 100 papers, many of which have appeared in premier conferences and journals such as SIGMOD, VLDB, ICDE, KDD, TODS, and TKDE, including several best paper awards. Dr. Das has served as the General Chair of ICIT 2009, Program Committee Vice-Chair of ICDM 2011, Program Chair of COMAD 2008, ICDE DBRank 2007, Best Paper Awards Chair of KDD 2006, Best Papers Awards committee of DAFSAA 2008, Program Chair of ICIT 2004, as well as in program committees of premier conferences such as SIGMOD, PODS, WWW, ICDE, KDD, and ICML. His research has been supported by grants from federal and state agencies such as the National Science Foundation, Office of Naval Research, Department of Education, Texas Higher Education Coordination Board, as well as industry such as Nokia, Microsoft, Cadence, and Apollo.


Crowdsourcing Applications and Platforms: A Data Management Perspective
Anhai Doan, Michael Franklin, Donald Kossmann, and Tim Kraska

Abstract. Over the past decade, crowdsourcing has emerged as a major problem-solving and data-gathering paradigm on the World-Wide Web, and has attracted much interest in the database community. This tutorial seeks to spark further interest and contribute to the gathering momentum of crowdsourcing in the community. It provides an overview of current crowdsourcing platforms and applications, describes data management approaches to crowdsourcing, and identifies data management-specific research challenges in the area.

AnHai Doan is an Associate Professor of Computer Science at the University of Wisconsin, Madison. He has worked extensively in crowdsourcing, starting with the MOBS project in 2002-2006, and the Cimple/DBLife project from 2006 to date. In these projects he has studied how to crowdsource data integration problems, and the problems of building structured databases from unstructured Web data. Currently he is on leave, working as Chief Scientist of @WalmartLabs, a newly established research and development lab of Walmart.

Michael J. Franklin is a Professor of Computer Science at the UC, Berkeley, where he is a Director of the newly-opened Algorithms, Machines and People Laboratory (AMPLab). AMPLab is a five-year, industry- supported collaboration working at the intersection of statistical machine learning (Algorithms), cloud computing (Machines), and crowdsourcing (People). AMPLab aims to combine these resources into a new generation of analytics systems for making sense of diverse data at scale. The CrowdDB hybrid crowd/cloud query processing system is an early effort in this direction.

Donald Kossmann is a professor for Computer Science at ETH Zurich (Switzerland).He completed his PhD at the Technical University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the Technical University of Munich, and the University of Heidelberg. He is an ACM fellow, member of the board of trustees of the VLDB endowment, and was the PC chair of the ACM SIGMOD Conf., 2009. He is a co-founder of i-TV-T (1998), XQRL Inc., and 28msec Inc.(2006). His research interests lie in the area of databases and information systems.

Tim Kraska is a PostDoc in the AMP Lab at the Computer Science Division of the UC, Berkeley. Before joining UC Berkeley, he received his PhD from ETH Zurich. Currently, his research focuses on the intersection between data management for analytics, machine learning, and crowdsourcing. As part of the CrowdDB project, he investigates the design of a hybrid crowd/cloud data management system to answer queries that neither database systems nor search engines can adequately answer.


Graph Data Management Systems for New Application Domains
Philippe Cudré-Mauroux and Sameh Elnikety

Abstract. Graph data management has long been a topic of interest for database researchers. The topic gained renewed interest recently, motivated by the rapid emergence of new application domains including social networks and the Web of data. This tutorial characterizes graph data management techniques and categorizes recent graph data management systems. In this context, we focus on the management of very large graphs such as social networks or the Web of data, rather than on the management of many smaller graphs (which frequently appear in bioinformatics and cheminformatics). The first part of this tutorial describes the requirements imposed by new application domains, and provides a classification of recent systems according to their data and computation models. Our classification also highlights the main representations used to store the graph (dense/sparse native graphs, triple storage or relational layouts), and the access patterns and typical queries considered (reachability or neighborhood queries, updates versus reads, transactional requirements and graph consistency models). In the second part of this tutorial, we map the data and computation models to concrete graph management systems, highlighting target application domains, implementation techniques, scalability and workload requirements.

Philippe Cudré-Mauroux is an Associate NSF Professor at the University of Fribourg in Switzerland. Previously, he was a postdoctoral associate working in the database systems group at MIT. He received his Ph.D. from the Swiss Federal Institute of Technology in Lausanne (EPFL), where he won both the Doctorate Award and the EPFL Press Mention in 2007. Before joining the University of Fribourg, he worked on distributed information management systems for HP, IBM T.J. Watson Research, and Microsoft Research Asia. His research interests are in exascale data management and infrastructures for non-relational data. He was the main investigator of the GridVine decentralized RDF storage system and is currently building dipLODocus[RDF], a scalable and efficient back-end to store and analyze very large graphs of Web data in the cloud. He will be PC Chair of the International Semantic Web conference in 2012 in Boston.

Sameh Elnikety is a researcher at Microsoft Research in Redmond, Washington. He received his Ph.D. from the Swiss Federal Institute of Technology (EPFL) in Lausanne, Switzerland, and a M.Sc. from Rice University in Houston, Texas. His research interests include social network systems, graph databases, and large-scale software systems. Sameh is currently building Horton, a distributed system than manages large graphs on commodity servers while providing a declarative query language with multi-version transactional support. Sameh Elnikety is the PC Chair of Social Network Systems (SNS 2011) workshop, and the keynote speaker at Graph Data Management (GDM 2011) workshop. Sameh’s work on database replication received the best paper award at Eurosys 2007.


Information Diffusion In Social Networks: Observing and Influencing Societal Interests
Divyakant Agrawal, Ceren Budak, and Amr El Abbadi

Abstract. Social networks provide great opportunities for social connection, learning, political and social change, as well as individual entertainment and enhancement in a wide variety of forms. Because many social interactions currently take place in online networks, social scientists have access to unprecedented amounts of information about social interaction. This wealth of data can allow scientists to study social interactions on a scale and at a level of detail that has never been possible before. In addition to providing a platform for scientists to observe social interactions at large scale, online social networks are also changing the very nature of social interactions. Understanding information diffusion in social networks is a critical research goal. Greater understanding can be achieved through data analysis, the development of reliable models that can predict outcomes of social processes, and ultimately the creation of applications that can shape the outcome of these processes. In this tutorial, we aim to provide an overview of such recent research based on a wide variety of techniques such as optimization algorithms, data mining, data streams covering a large number of problems such as influence spread maximization, misinformation limitation and study of trends in online social networks.

Divyakant Agrawal is a Professor of the Department of Computer Science at University of California, Santa Barbara. Prof. Agrawal’s research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems.

Ceren Budak is a PhD Candidate at the Department of Computer Science, University of California Santa Barbara. Her research interests lie in the area of information diffusion in social networks.

Amr El Abbadi is a Professor of the Department of Computer Science at University of California, Santa Barbara. His research interests lie in the area of scalable database and distributed systems