VLDB 2021: Tutorials

All times are given to the Copenhagen local timezone at the conference time (CEST).

17Aug

09:00 - 10:30 CESTTutorial 1

Machine Learning for Databases [Download Paper] Guoliang Li (Tsinghua University), Xuanhe Zhou (Tsinghua University), and Lei Cao (MIT)

Download Slides

Machine learning techniques have been proposed to optimize the databases. For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning, index and view advisor) cannot meet the high-performance requirement for large-scale database instances, various applications and diversified users, especially on the cloud. Fortunately, learning-based techniques can alleviate this problem. In this tutorial, we review existing studies on machine learning for databases. We review learning-base database configurations (e.g., knob tuning), learning-based database optimization (e.g., cost/cardinality estimation), learning-based database design (e.g., learned index), learning-based database monitoring and diagnosis (e.g., slow SQL query diagnosis), learning-based database security (e.g., sensitive data discovery). We also provide research challenges in machine learning for databases.

13:00 - 15:00 CESTTutorial 2

On the Limits of Machine Knowledge: Completeness, Recall and Negation in Web-scale Knowledge Bases [Download Paper] Simon Razniewski (Max-Planck-Institut für Informatik), Hiba Arnaout (Max-Planck-Institut für Informatik), Shrestha Ghosh (Max-Planck-Institut für Informatik), and Fabian Suchanek (Télécom ParisTech)

Download Slides

General-purpose knowledge bases (KBs) are an important component of several data-driven applications. Pragmatically constructed from available web sources, these KBs are far from complete, which poses a set of challenges in curation as well as consumption. In this tutorial we discuss how completeness, recall and negation in DBs and KBs can be represented, extracted, and inferred. We proceed in 5 parts: (i) We introduce the logical foundations of knowledge representation and querying under partial closed-world semantics. (ii) We show how information about recall can be identified in KBs and in text, and (iii) how it can be estimated via statistical patterns. (iv) We show how interesting negative statements can be identified, and (v) how recall can be targeted in a comparative notion.

18Aug

09:00 - 10:30 CESTTutorial 3

Array DBMS: Past, Present, and (Near) Future [Download Paper] Ramon Antonio Rodriges Zalipynis (HSE University)

Download Slides

Array DBMSs strive to be the best systems for managing, processing, and even visualizing big N-d arrays. The last decade blossomed with R&D in array DBMS, making it a young and fast-evolving area. We present the first comprehensive tutorial on array DBMS R&D. We start from past impactful results that are still relevant today, then we cover contemporary array DBMSs, array-oriented systems, and state-of-the-art research in array management flavored with numerous promising R&D opportunities for future work. A great deal of our tutorial was not covered in any previous tutorial or survey article. Advanced array management research is just emerging and many R&D opportunities still “lie on the surface”. Hence, nowadays we have the most favorable conditions to start contributing to this research area. This tutorial will jump-start such efforts.

16:15 - 17:45 CESTTutorial 4

Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems [Download Paper] Laurel Orr (Stanford University), Atindriyo Sanyal (Uber AI), Xiao Ling (Apple) Karan Goel (Stanford University), and Megan Leszczynski (Stanford University)

Download Slides

The industrial machine learning pipeline requires iterating on model features, training and deploying models, and monitoring deployed models at scale. Feature stores were developed to manage and standardize the engineer's workflow in this end-to-end pipeline, focusing on traditional tabular feature data. In recent years, however, model development has shifted towards using self-supervised pretrained embeddings as model features. Managing these embeddings and the downstream systems that use them introduces new challenges with respect to managing embedding training data, measuring embedding quality, and monitoring downstream models that use embeddings. These challenges are largely unaddressed in standard feature stores. Our goal in this tutorial is to introduce the feature store system and discuss the challenges and current solutions to managing these new embedding-centric pipelines.

19Aug

09:00 - 10:30 CESTTutorial 5

New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed [Download Paper] Karima Echihabi (Mohammed VI Polytechnic University), Themis Palpanas (University of Paris), and Kostas Zoumpatianos (Snowflake Computing)

Download Slides

Similarity search is a core operation of many critical data science applications, involving massive collections of high-dimensional (high-d) objects. Similarity search finds objects in a collection close to a given query according to some definition of sameness. Objects can be data series, text, multimedia, graphs, database tables or deep network embeddings. In this tutorial, we revisit the similarity search problem in light of the recent advances in the field and the new big data landscape. We discuss key data science applications that require efficient high-d similarity search, we survey recent approaches and share surprising insights about their strengths and weaknesses, and we discuss open research problems, including the directions of AI-driven, progressive, and distributed high-d similarity search.

11:00 - 12:30 CESTTutorial 6

Machine Learning for Cloud Data Systems: the Promise, the Progress, and the Path Forward [Download Paper] Alekh Jindal (Microsoft), and Matteo Interlandi (Microsoft) The goal of this tutorial is to educate the audience about the state of the art in ML for cloud data systems, both in research and in practice. The tutorial is divided in three parts: the promise, the progress, and the path forward.
Part I of the tutorial focuses on the early promise of applying ML for systems, with researchers identifying opportunities across the cloud data system stack. The goal here is to cover the breadth of the topic and give audience an overview of the enormous potential that has been discovered in recent years.
Part II covers the recent successes in deploying machine learning solutions for cloud data systems. We will discuss the practical considerations taken into account and the progress made at various levels. The goal is to compare and contrast the promise of ML for systems with the ground actually covered in industry.
Finally, Part III discusses practical issues of machine learning in the enterprise covering the generation of explanations, model debugging, model deployment, model management, constraints on eyes-on data usage and anonymization, and a discussion of the technical debt that can accrue through machine learning and models in the enterprise.

13:30 - 15:00 CESTTutorial 7

Data Augmentation for ML-driven Data Preparation and Integration [Download Paper] Yuliang Li (Megagon Labs), Xiaolan Wang (Megagon Labs), Zhengjie Miao (Duke University), and Wang-Chiew Tan (Facebook AI)

Download Slides

In recent years, we have witnessed the development of novel data augmentation (DA) techniques for creating additional training data needed by machine learning-based solutions. In this tutorial, we will provide a comprehensive overview of techniques developed by the data management community for data preparation and data integration. In addition to surveying task-specific DA operators that leverage rules, transformations, and external knowledge for creating additional training data, we also explore the advanced DA techniques such as interpolation, conditional generation, and DA policy learning. Finally, we describe the connection between DA and other machine learning paradigms such as active learning, pre-training, and weakly-supervised learning. We hope that this discussion can shed light on future research directions for a holistic data augmentation framework for high-quality dataset creation.

17:00 - 18:30 CESTTutorial 8

Extending the Lifetime of NVM: Challenges and Opportunities [Download Paper] Saeed Kargar (UCSC), and Faisal Nawab (University of California at Irvine)

Download Slides

Recently, Non-Volatile Memory (NVM) technology has revolutionized the landscape of memory systems. With many advantages, such as non volatility and near zero standby power consumption, these byte-addressable memory technologies are taking the place of DRAMs. Nonetheless, they also present some limitations, such as limited write endurance, which hinders their widespread use in today's systems. Furthermore, adjusting current data management systems to embrace these new memory technologies and all their potential is proving to be a nontrivial task. Because of this, a substantial amount of research has been done, from both the database community and the storage systems community, that tries to improve various aspects of NVMs to integrate these technologies into the memory hierarchy. In this tutorial we survey state-of-the-art work on deploying NVMs in database and storage systems communities and the ways their limitations are being handled within these communities. In particular, we focus on the challenges that are related to low write endurance and extending the lifetime of NVM devices.