VLDB 2021: Tutorials
All times are given to the Copenhagen local timezone at the conference time (CEST).
On the Limits of Machine Knowledge: Completeness, Recall and Negation in Web-scale Knowledge Bases [Download Paper] Simon Razniewski (Max-Planck-Institut für Informatik), Hiba Arnaout (Max-Planck-Institut für Informatik), Shrestha Ghosh (Max-Planck-Institut für Informatik), and Fabian Suchanek (Télécom ParisTech)General-purpose knowledge bases (KBs) are an important component of several data-driven applications. Pragmatically constructed from available web sources, these KBs are far from complete, which poses a set of challenges in curation as well as consumption. In this tutorial we discuss how completeness, recall and negation in DBs and KBs can be represented, extracted, and inferred. We proceed in 5 parts: (i) We introduce the logical foundations of knowledge representation and querying under partial closed-world semantics. (ii) We show how information about recall can be identified in KBs and in text, and (iii) how it can be estimated via statistical patterns. (iv) We show how interesting negative statements can be identified, and (v) how recall can be targeted in a comparative notion.
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems [Download Paper] Laurel Orr (Stanford University), Atindriyo Sanyal (Uber AI), Xiao Ling (Apple) Karan Goel (Stanford University), and Megan Leszczynski (Stanford University)The industrial machine learning pipeline requires iterating on model features, training and deploying models, and monitoring deployed models at scale. Feature stores were developed to manage and standardize the engineer's workflow in this end-to-end pipeline, focusing on traditional tabular feature data. In recent years, however, model development has shifted towards using self-supervised pretrained embeddings as model features. Managing these embeddings and the downstream systems that use them introduces new challenges with respect to managing embedding training data, measuring embedding quality, and monitoring downstream models that use embeddings. These challenges are largely unaddressed in standard feature stores. Our goal in this tutorial is to introduce the feature store system and discuss the challenges and current solutions to managing these new embedding-centric pipelines.
New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed [Download Paper] Karima Echihabi (Mohammed VI Polytechnic University), Themis Palpanas (University of Paris), and Kostas Zoumpatianos (Snowflake Computing)Similarity search is a core operation of many critical data science applications, involving massive collections of high-dimensional (high-d) objects. Similarity search finds objects in a collection close to a given query according to some definition of sameness. Objects can be data series, text, multimedia, graphs, database tables or deep network embeddings. In this tutorial, we revisit the similarity search problem in light of the recent advances in the field and the new big data landscape. We discuss key data science applications that require efficient high-d similarity search, we survey recent approaches and share surprising insights about their strengths and weaknesses, and we discuss open research problems, including the directions of AI-driven, progressive, and distributed high-d similarity search.
Machine Learning for Cloud Data Systems: the Promise, the Progress, and the Path Forward [Download Paper]
Alekh Jindal (Microsoft), and
Matteo Interlandi (Microsoft)
The goal of this tutorial is to educate the audience about the state of the art in ML for cloud data
systems, both in research and in practice. The tutorial is divided in three parts: the promise, the
progress, and the path forward.
Part I of the tutorial focuses on the early promise of applying ML for systems, with researchers identifying opportunities across the cloud data system stack. The goal here is to cover the breadth of the topic and give audience an overview of the enormous potential that has been discovered in recent years.
Part II covers the recent successes in deploying machine learning solutions for cloud data systems. We will discuss the practical considerations taken into account and the progress made at various levels. The goal is to compare and contrast the promise of ML for systems with the ground actually covered in industry.
Finally, Part III discusses practical issues of machine learning in the enterprise covering the generation of explanations, model debugging, model deployment, model management, constraints on eyes-on data usage and anonymization, and a discussion of the technical debt that can accrue through machine learning and models in the enterprise.
Data Augmentation for ML-driven Data Preparation and Integration [Download Paper] Yuliang Li (Megagon Labs), Xiaolan Wang (Megagon Labs), Zhengjie Miao (Duke University), and Wang-Chiew Tan (Facebook AI)In recent years, we have witnessed the development of novel data augmentation (DA) techniques for creating additional training data needed by machine learning-based solutions. In this tutorial, we will provide a comprehensive overview of techniques developed by the data management community for data preparation and data integration. In addition to surveying task-specific DA operators that leverage rules, transformations, and external knowledge for creating additional training data, we also explore the advanced DA techniques such as interpolation, conditional generation, and DA policy learning. Finally, we describe the connection between DA and other machine learning paradigms such as active learning, pre-training, and weakly-supervised learning. We hope that this discussion can shed light on future research directions for a holistic data augmentation framework for high-quality dataset creation.