# VLDB 2021: Research Track Papers

This page lists the Research Sessions and the corresponding papers that will be presented in the conference. Note that all the following papers have been published in the Proceedings of VLDB. VLDB 2021 includes the papers published in PVLDB Vol. 13, No. 13 (roll-over) and the papers published in PVLDB Vol. 14, No. 1 - 12.

All times are given to the Copenhagen local timezone at the conference time (CEST).

17Aug

09:00 – 10:30 CESTResearch Session 1: Persistent Memory I Chaired by Peter Boncz

Zen: a High-Throughput Log-Free OLTP Engine for Non-Volatile Main Memory [Download Paper] (Chinese Academy of Sciences), (Chinese Academy of Sciences), (Chinese Academy of Sciences) Emerging Non-Volatile Memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of magnitude slower than DRAM and only supports blocked-oriented access have changed. NVMs are byte-addressable and almost as fast as DRAM. The capacity of NVM is much (4-16x) larger than DRAM. Such NVM characteristics make it possible to build OLTP database entirely in NVM main memory.
This paper studies the structure of OLTP engines with hybrid NVM and DRAM memory. We observe three challenges to design an OLTP engine for NVM: tuple metadata modifications, NVM write redundancy, and NVM space management. We propose Zen, a high-throughput log-free OLTP engine for NVM. Zen addresses the three design challenges with three novel techniques: metadata enhanced tuple cache, log-free persistent transactions, and light-weight NVM space management. Experimental results on a real machine equipped with Intel Optane DC Persistent Memory show that Zen achieves up to 10.1x improvement compared with existing solutions to run an OLTP database as large as the size of NVM while achieving fast failure recovery.

Revisiting the Design of LSM-tree Based OLTP Storage Engine with Persistent Memory [Download Paper] (Beihang University), (AZFT), (Beihang University), (AZFT), (AZFT), (AZFT), (alibaba), (AZFT), (AZFT), (Alibaba) The recent byte-addressable and large-capacity commercialized persistent memory (PM) is promising to drive database as a service (DBaaS) into unchartered territories. This paper investigates how to leverage PMs to revisit the conventional LSM-tree based OLTP storage engines designed for DRAM-SSD hierarchy for DBaaS instances. Specifically, we (1) propose a light-weight PM allocator named Halloc customized for LSM-tree, (2) build a high-performance Semi-persistent Memtable utilizing the persistent in-memory writes of PM, (3) design a concurrent commit algorithm named Reorder Ring to aschieve log-free transaction processing for OLTP workloads and (4) present a Global Index as the new globally sorted persistent level with non-blocking in-memory compaction. The design of Reorder Ring and Semi-persistent Memtable achieves fast writes without synchronized logging overheads and achieves near instant recovery time. Moreover, the design of Semi-persistent Memtable and Global Index with in-memory compaction enables the byte-addressable persistent levels in PM, which significantly reduces the read and write amplification as well as the background compaction overheads. The overall evaluation shows that the performance of our proposal over PM-SSD hierarchy outperforms the baseline by up to 3.8x in YCSB benchmark and by 2x in TPC-C benchmark.

SaS: SSD as SQL Database System [Download Paper] (Sungkyunkwan University), (SungKyunKwan University), (Sungkyunkwan University), (Sungkyunkwan University) Every database engine runs on top of an operating system in the host, strictly separated with the storage. This more-than-half-century-old IHDE (In-Host-Database-Engine) architecture, however, reveals its limitations when run on fast flash memory SSDs. In particular, the IO stacks incur significant run-time overhead and also hinder vertical optimizations between database engines and SSDs. In this paper, we envisage a new database architecture, called SaS (SSD as SQL database engine), where a full-blown SQL database engine runs inside SSD, tightly integrated with SSD architecture without intervening kernel stacks. As IO stacks are removed, SaS is free from their run-time overhead and further can explore numerous vertical optimizations between database engine and SSD. SaS evolves SSD from dummy block device to database server with SQL as its primary interface. The benefit of SaS will be more outstanding in the data centers where the distance between database engine and the storage is ever widening because of virtualization, storage disaggregation, and open software stacks. The advent of computational SSDs with more compute resource will enable SaS to be more viable and attractive database architecture.

09:00 – 10:30 CESTResearch Session 2: Data Mining Chaired by Martin Hentschel

On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection [Download Paper] (New York University), (Monash University), (RMIT University) This paper presents a thorough evaluation of the existing methods that accelerate Lloyd’s algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. Our UniK embraces a class of well-known methods and enables a fine-grained performance breakdown of existing methods. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we also investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.

Comprehensible Counterfactual Explanation on Kolmogorov-Smirnov Test [Download Paper] (Simon Fraser University), (McMaster University), (City University of Hong Kong), (Simon Fraser University) The Kolmogorov-Smirnov (KS) test is popularly used in many applications, such as anomaly detection, astronomy, database security and AI systems. One challenge remained untouched is how we can obtain an explanation on why a test set fails the KS test. In this paper, we tackle the problem of producing counterfactual explanations for test data failing the KS test. Concept-wise, we propose the notion of most comprehensible counterfactual explanations, which accommodates both the KS test data and the user domain knowledge in producing explanations. Computation-wise, we develop an efficient algorithm MOCHE (for MOst CompreHensible Explanation) that avoids enumerating and checking an exponential number of subsets of the test set failing the KS test. MOCHE not only guarantees to produce the most comprehensible counterfactual explanations, but also is orders of magnitudes faster than the baselines. Experiment-wise, we present a systematic empirical study on a series of benchmark real datasets to verify the effectiveness, efficiency and scalability of most comprehensible counterfactual explanations and MOCHE.

GeCo: Quality Counterfactual Explanations in Real Time [Download Paper] (University of Washington), (University of Washington), (University of Washington), (University of Washington) Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because of the inherent tension between a rich semantics of the domain, and the need for real time response. In this paper we present GeCo, the first system that can compute plausible and feasible counterfactual explanations in real time. At its core, GeCo relies on a genetic algorithm, which is customized to favor searching counterfactual explanations with the smallest number of changes. To achieve real-time performance, we introduce two novel optimizations: $\Delta$-representation of candidate counterfactuals, and partial evaluation of the classifier. We compare empirically GeCo against five other systems described in the literature, and show that it is the only system that can achieve both high quality explanations and real time answers.

A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [Download Paper] (East China Normal University), (East China Normal University), (Hong Kong University of Science and Technology), (University of New South Wales), (Sun Yat-sen University) With the rapid development of smart mobile devices, the car-hailing platforms (e.g., Uber or Lyft) have attracted much attention from both the academia and the industry. In this paper, we consider an important dynamic car-hailing problem, namely maximum revenue vehicle dispatching (MRVD), in which rider requests dynamically arrive and drivers need to serve as many riders as possible such that the entire revenue of the platform is maximized. We prove that the MRVD problem is NP-hard and intractable. In addition, the dynamic car-hailing platforms have no information of the future riders, which makes the problem even harder. To handle the MRVD problem, we propose a queueing-based vehicle dispatching framework, which first uses existing machine learning algorithms to predict the future vehicle demand of each region, then estimates the idle time periods of drivers through a queueing model for each region. With the information of the predicted vehicle demands and estimated idle time periods of drivers, we propose two batch-based vehicle dispatching algorithms to efficiently assign suitable drivers to riders such that the expected overall revenue of the platform is maximized during each batch processing. Through extensive experiments, we demonstrate the efficiency and effectiveness of our proposed approaches over both real and synthetic datasets. In summary, our methods can achieve 5~10% increase on overall revenue without sacrifice on running speed compared with existing solutions.

09:00 – 10:30 CESTResearch Session 3: Sketches Chaired by Thomas Neumann

On the algebra of data sketches [Download Paper] (Wrocław University of Science and Technology) We consider the problem of designing a distributed data sketch for scenario in which data stream is observed by many independent network nodes. We require that a sketch apart from being computationally and memory efficient should also be mergeable in a way that mimics set theory operations on related data sets. For example, when monitoring network traffic, one may consider how many distinct packets passed through a given node (sum of sets for different time windows) or passed through two given nodes (intersection of sets from two locations) and what is their total size (intersection of weighted sets).
In this paper we propose a sketch that allows to efficiently summarize sets constructed from a sequence of set theory operations. We also provide an analytical control over the trade-off between the accuracy and storage/computational requirements. In comparison to the previous works the proposed solution 1) allows the weights of elements, 2) allows performing set theory operations simultaneous on a large number of sketches, 3) does not require computationally expensive numerical calculations and guarantees low overheads.

SetSketch: Filling the Gap between MinHash and HyperLogLog [Download Paper] (Dynatrace Research) MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

SKT: A One-Pass Multi-Sketch Data Analytics Accelerator [Download Paper] (ETH Zurich), (Accemic Technologies), (ETHZ) Data analysts often need to characterize a data stream as a basic first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter characterizing the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the growing specialization of the underlying compute infrastructure by hardware accelerators, this paper proposes SKT, an FPGA-based bump-in-the-wire accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes streams at TCP/IP line rates of 100 Gbps and is built to fit emerging cloud service architectures, such as Microsoft’s Catapult or Amazon’s AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. It demonstrates by extensive experimentation how the FPGA-accelerated SKT implementation achieves a significant performance gain over high-end, server-class CPUs.

KLL±: Approximate Quantile Sketches over Dynamic Datasets [Download Paper] (UCSB), (University Of California, Santa Barbara), (UCSB), (University of California, Santa Barbara), (UC Santa Barbara) Recently the long standing problem of optimal construction of quantile sketches was resolved by Karnin, Lang, and Liberty in what they propose as KLL sketch (FOCS 2016). The algorithm for KLL is only online updatable for insert operations, but not for delete operations. For tasks like data partitioning through quantile range, it is necessary to support delete operations. When the data set is updated dynamically, i.e., when data elements are inserted and deleted, the quantile sketch should reflect the changes. In this paper, we propose KLL± -- the first quantile approximation algorithm to account for both inserts and deletes in the data steam. KLL± extends the functionality of KLL sketch to support arbitrary updates with small space overhead. The space bound for KLL± is O(α^(1.5)/ε log2log(1/εδ)), where ε and δ are constants that determine precision and failure probability, and α bounds the number of deletions with respect to insert operations. The experimental evaluation of KLL± highlights that with minimal space overhead, KLL± achieves comparable accuracy in quantile approximation as compared to KLL.

On-Off Sketch: A Fast and Accurate Sketch on Persistence [Download Paper] (Peking University), (Peking University), (Xiangtan University), (Peking University), (湘潭大学), (Huawei), (Peking University) Approximate stream processing has attracted much attention recently. Prior art mostly focuses on characteristics like frequency, cardinality, and quantile. Persistence, as a new characteristic, is getting increasing attention. Unlike frequency, persistence highlights behaviors where an item appears recurrently in many time windows of a data stream. There are two typical problems with persistence – persistence estimation and finding persistent items. In this paper, we propose the On-Off sketch to address both problems. For persistence estimation, using the characteristic that the persistence of an item is increased periodically, we compress increments when multiple items are mapped to the same counter, which significantly reduces the error. Compared with the Count-Min sketch, 1) in theory, we prove that the error of the On-Off sketch is always smaller; 2) in experiments, the On-Off sketch achieves around 6.17 times smaller error and 2.2 times higher throughput. For finding persistent items, we propose a technique to separate persistent and non-persistent items, further improving the accuracy. We show that the space complexity of our On-Off sketch is much better than the state-of-the-art (PIE), and it reduces the error up to 4 orders of magnitude and achieves 2.84 times higher throughput than prior algorithms in experiments.

09:00 – 10:30 CESTResearch Session 4: Data Preparation Chaired by Damianos Chatziantoniou

LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization [Download Paper] (University of California, Irvine), (University of California, Irvine), (UC Irvine), (Telecom SudParis), (university of California Irvine), (U.C. Irvine), (University of California, Irvine) This paper explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points (APs), each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER (LOCATER), postulates semantic localization as a series of data cleaning tasks - first, it treats the problem of determining the AP to which a device is connected between any two of its connection events as a missing value detection and repair problem. It then associates the device with the semantic subregion (e.g., a conference room in the region) by postulating it as a location disambiguation problem. LOCATER uses a bootstrapping semi-supervised learning method for coarse localization and a probabilistic method to achieve finer localization. The paper shows that LOCATER can achieve significantly high accuracy at both the coarse and fine levels.

CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties [Download Paper] (University of Pennsylvania), (University of Pennsylvania), (University of Pennsylvania) High-quality labels are expensive to obtain for many machine learning tasks, such as medical image classification tasks. Therefore, probabilistic (weak) labels produced by weak supervision tools are used to seed a process in which influential samples with weak labels are identified and cleaned by several human annotators to improve the model performance. To lower the overall cost and computational overhead of this process, we propose a solution called Chef(CHEap and Fast label cleaning), which consists of the following three components. First, to reduce the cost of human annotators, we use Infl, which prioritizes the most influential training samples for cleaning and provides cleaned labels to save the cost of one human annotator. Second, to accelerate the sample selector phase and the model constructor phase, we use Increm-Infl to incrementally produce influential samples, and DeltaGrad-L to incrementally update the model. Third, we redesign the typical label cleaning pipeline so that human annotators iteratively clean smaller batch of samples rather than one big batch of samples. This yields better over all model performance and enables possible early termination when the expected model performance has been achieved. Extensive experiments show that our approach gives good model prediction performance while achieving significant speed-ups.

Horizon: Scalable Dependency-driven Data Cleaning [Download Paper] (MIT), (Qatar Computing Research Institute, HBKU), (Purdue University), (QCRI), (Purdue University), (MIT) An often-cited figure reports that data scientists spend at least 80% of their time preparing and cleaning data. As a result, the need for effective and scalable data cleaning systems has never been more pressing. A large class of data repair algorithms rely on integrity constraints to detect and repair errors. A well-studied class of such constraints is Functional Dependencies (FDs, for short). Although there has been an increased interest in developing general data cleaning systems for a myriad of data errors, scalability has been left behind. This is because current systems assume data cleaning is done offline and in one iteration. However, developing data science pipelines is highly iterative and requires efficient cleaning techniques to scale to millions of records in seconds/minutes, not days. In our efforts to re-think the data cleaning stack and bring it to the era of data science, we introduce Horizon, an end-to-end FD repair system to address two key challenges: (1) Accuracy: most existing FD repair techniques aim to produce repairs that minimize changes to the data. However, this may lead to incorrect combinations of attribute values (or patterns). Horizon smartly leverages the interaction between data patterns induced by various FDs and subsequently selects repairs that preserve the most frequent patterns found in the original data, leading to a better repair accuracy. (2) Scalability: existing data cleaning systems struggle when dealing with large-scale real-world datasets. Horizon features a linear-time repair algorithm that can scale to millions of records and is orders of magnitude faster than state-of-the-art cleaning algorithms. We benchmark Horizon and state-of-the-art cleaning systems on multiple datasets and metrics and show that Horizon consistently outperforms existing techniques both in terms of repair quality and scalability.

09:00 – 10:30 CESTResearch Session 5: Graph Analysis Chaired by Philippe Cudre-Mauroux

On Analyzing Graphs with Motif-Paths [Download Paper] (The University of Hong Kong), (The University of Hong Kong, China), (University of Illinois at Urbana-Champaign), (The University of Hong Kong), (The University of Hong Kong), (University of Illinois at Urbana-Champaign) Path-based solutions have been shown to be useful for various graph analysis tasks, such as link prediction and graph clustering. However, they are no longer adequate for handling complex and gigantic graphs. Recently, motif-based analysis has attracted a lot of attention. A motif, or a small graph with a few nodes, is often considered as a fundamental unit of a graph. Motif-based analysis captures high-order structure between nodes, and performs better than traditional edge-based'' solutions. In this paper, we study motif-path, which is conceptually a concatenation of one or more motif instances. We examine how motif-paths can be used in three path-based mining tasks, namely link prediction, local graph clustering and node ranking. We further address the situation when two graph nodes are not connected through a motif-path, and develop a novel defragmentation method to enhance it. Experimental results on real graph datasets demonstrate the use of motif-paths and defragmentation techniques improves graph analysis effectiveness.

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra [Download Paper] (ETH Zurich), (ETH Zurich), (ETH Zurich), (ETH Zurich), (ETH Zurich), (ETH Zurich), (VSB), (AGH-UST), (ETH), (ETHZ), (ETH), (ETH Zürich), (ETH Zurich), (ETH Zurich), (ETH Zurich), (AGH-UST), (ETH Zurich), (ETH Zurich) We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by >2x), maximal clique listing (by >9x), k-clique listing (by up to 1.1x), and subgraph isomorphism (by 2.5x), also obtaining better theoretical performance bounds.

A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search [Download Paper] (Hangzhou Dianzi University), (Hangzhou Dianzi University), (Hangzhou Dianzi University), (Hangzhou Dianzi University) Approximate nearest neighbor search (ANNS) constitutes an important operation in a multitude of applications, including recommendation systems, information retrieval, and pattern recognition. In the past decade, graph-based ANNS algorithms have been the leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, efficient solutions for retrieving the nearest neighbors for a given query. Nevertheless, these efforts focus on developing and optimizing algorithms with different approaches, so there is a real need for a comprehensive survey about the approaches' relative performance, strengths, and pitfalls. Thus here we provide a thorough comparative analysis and experimental evaluation of 13 representative graph-based ANNS algorithms via a new taxonomy and fine-grained pipeline. We compared each algorithm in a uniform test environment on eight real-world datasets and 12 synthetic datasets with varying sizes and characteristics. Our study yields novel discoveries, offerings several useful principles to improve algorithms, thus designing an optimized method that outperforms the state-of-the-art algorithms. This effort also helped us pinpoint algorithms' working portions, along with rule-of-thumb recommendations about promising research directions and suitable algorithms for practitioners in different fields.

Accelerating Exact Constrained Shortest Paths on GPUs [Download Paper] (National University of Singapore), (National University of Singapore), (Singapore Management University), (Tianjin University) The recently emerging applications such as software-defined networks and autonomous vehicles require efficient and exact solutions for constrained shortest paths (CSP), which finds the shortest path in a graph while satisfying some user-defined constraints. Compared with the common shortest path problems without constraints, CSP queries have a significantly larger number of subproblems. The most widely used labeling algorithm becomes prohibitively slow and impractical. Other existing approaches tend to find approximate solutions and build costly indices on graphs for fast query processing, which are not suitable for emerging applications with the requirement of exact solutions. A natural question is whether and how we can efficiently find the exact solution for CSP.
In this paper, we propose Vine, a framework that parallelizes the labeling algorithm to efficiently find the exact CSP solution using GPUs. The major challenge addressed in Vine is how to deal with a large number of subproblems that are mostly unpromising but require a significant amount of memory and computational resources. Our solution is twofold. First, we develop a two-level pruning approach to eliminate the subproblems by making good use of the GPU's hierarchical memory. Second, we propose an adaptive parallelism control model based on the observations that the degree of parallelism (DOP) is the key to performance optimization with the given amount of computational resources. Extensive experiments show that Vine achieves 18x speedup on average over the widely adopted CPU-based solution running on 40 CPU threads. Vine also has over 5x speedup compared with a GPU approach that statically controls the DOP. Compared to the state-of-the-art approximate solution with preprocessed indices, Vine provides exact results with competitive or even better performance.

Fast Algorithm for Anchor Graph Hashing [Download Paper] (NTT Communication Science Laboratories), (NTT Software Innovation Center), (NTT Software Innovation Center), (NTT Software Innovation Center), (NTT Communication Science Labs.) Anchor graph hashing is used in many applications such as cancer detection, web page classification, and drug discovery. It computes the hash codes from the eigenvectors of the matrix representing the similarities between data points and anchor points; anchors refer to the points representing the data distribution. In performing an approximate nearest neighbor search, the hash codes of a query data point are determined by identifying its closest anchor points. Anchor graph hashing, however, incurs high computation cost since (1) the computation cost of obtaining the eigenvectors is quadratic to the number of anchor points, and (2) the similarities of the query data point to all the anchor points must be computed. Our proposal, Tridiagonal hashing, increases the efficiency of anchor graph hashing because of its two advances: (1) we apply a graph clustering algorithm to compute the eigenvectors from the tridiagonal matrix obtained from the similarities between data points and anchor points, and (2) we detect anchor points closest to the query data point by using a dimensionality reduction approach. Experiments show that our approach is several orders of magnitude faster than the previous approaches. Besides, it yields high search accuracy than the original anchor graph hashing approach.

11:00 – 12:00 CESTResearch Session 6: Memory Management Chaired by Ippokratis Pandis

Breaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems [Download Paper] (Snowflake Inc.), (UC Irvine) Log-Structured Merge-trees (LSM-trees) have been widely used in modern NoSQL systems. Due to their out-of-place update design, LSM-trees have introduced memory walls among the memory components of multiple LSM-trees and between the write memory and the buffer cache. Optimal memory allocation among these regions is non-trivial because it is highly workload-dependent. Existing LSM-tree implementations instead adopt static memory allocation schemes due to their simplicity and robustness, sacrificing performance. In this paper, we attempt to break down these memory walls in LSM-based storage systems. We first present a memory management architecture that enables adaptive memory management. We then present a partitioned memory component structure with new flush policies to better exploit the write memory to minimize the write cost. To break down the memory wall between the write memory and the buffer cache, we further introduce a memory tuner that tunes the memory allocation between these two regions. We have conducted extensive experiments in the context of Apache AsterixDB using the YCSB and TPC-C benchmarks and we present the results here.

Achieving High Throughput and Elasticity in a Larger-than-Memory Store [Download Paper] (University of Utah), (Microsoft Research), (University of Utah) Millions of sensors, mobile applications and machines now generate billions of events. Specialized many-core key-value stores (KVSs) can ingest and index these events at high rates (over 100 Mops/s on one machine) if events are generated on the same machine; however, to be practical and cost-effective they must ingest events over the network and scale across cloud resources elastically.
We present Shadowfax, a new distributed KVS based on FASTER, that transparently spans DRAM, SSDs, and cloud blob storage while serving 130 Mops/s/VM over commodity Azure VMs using conventional Linux TCP. Beyond high single-VM performance, Shadowfax uses a unique approach to distributed reconfiguration that avoids any server-side key ownership checks or cross-core coordination both during normal operation and migration. Hence, Shadowfax can shift load in 17 s to improve system throughput by 10 Mops/s with little disruption. Compared to the state-of-the-art, it has 8× better throughput (than Seastar+memcached) and avoids costly I/O to move cold data during migration. On 12 machines, Shadowfax retains its high throughput to perform 930 Mops/s, which, to the best of our knowledge, is the highest reported throughput for a distributed KVS used for large-scale data ingestion and indexing.

Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation [Download Paper] (Alibaba Group), (USTC), (USTC), (Alibaba Group), (Alibaba), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba Group), (USTC), (USTC) It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase’s system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99×-2.33×, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87× and 5.48× speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.

11:00 – 12:00 CESTResearch Session 7: User Interfaces Chaired by Eleni Tzirita Zacharatou

Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks [Download Paper] (NTU), (Nanyang Technological University), (Nanyang Technological University), (Fudan University), (POSTECH), (Hong Kong Baptist University) Canned patterns (i.e., small subgraph patterns) in visual graph query interfaces (a.k.a GUI) facilitate efficient query formulation by enabling pattern-at-a-time construction mode. However, existing GUIs for querying large networks either do not expose any canned patterns or if they do then they are typically selected manually based on domain knowledge. Unfortunately, manual generation of canned patterns is not only labor intensive but may also lack diversity for supporting efficient visual formulation of a wide range of subgraph queries. In this paper, we present a novel generic and extensible framework called TATTOO that takes a data-driven approach to automatically selecting canned patterns for a GUI
from large networks. Specifically, it first decomposes the underlying network into truss-infested and truss-oblivious regions. Then candidate canned patterns capturing different real-world query topologies are generated from these regions. Canned patterns based on a user-specified plug are then selected for the GUI from these candidates by maximizing coverage and diversity, and by minimizing the cognitive load of the pattern set. Experimental studies with real-world datasets demonstrate the benefits of TATTOO. Importantly, this work takes a concrete step towards realizing plug-and-play visual graph query interfaces for large networks.

PATSQL: Efficient Synthesis of SQL Queries from Example Tables with Quick Inference of Projected Columns [Download Paper] (NTT DATA), (Nara Institute of Science and Technology), (NTT DATA), (NTT DATA) SQL is one of the most popular tools for data analysis, and it is now used by an increasing number of users without having expertise in databases. Several studies have proposed programming-by-example approaches to help such non-experts to write correct SQL queries. While existing methods support a variety of SQL features such as aggregation and nested query, they suffer a significant increase in computational cost as the scale of example tables increases. In this paper, we propose an efficient algorithm utilizing properties known in relational algebra to synthesize SQL queries from input and output tables. Our key insight is that a projection operator in a program sketch can be lifted above other operators by applying transformation rules in relational algebra, while preserving the semantics of the program. This enables a quick inference of appropriate columns in the projection operator, which is an essential component in synthesis but causes combinatorial explosions in prior work. We also introduce a novel form of constraints and its top-down propagation mechanism for efficient sketch completion. We implemented this algorithm in our tool PATSQL and evaluated it on 226 queries from prior benchmarks and Kaggle’s tutorials. As a result, PATSQL solved 67% of the benchmarks and found 90% of the solutions within a second. Our tool is available at https://naist-se.github.io/patsql/.

COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics [Download Paper] (Microsoft Research), (Microsoft), (Microsoft) Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today’s database systems, user-defined functions (UDF), as well as middleware solutions that compare subsets outside the databases.

11:00 – 12:00 CESTResearch Session 8: Queries Chaired by Sven Groppe

Rumble: Data Independence for Large Messy Data Sets [Download Paper] (ETH Zürich), (ETH Zurich), (Beekeeper AG), (ETH Zurich), (ETHZ) This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized language specifically designed for querying JSON documents. The key challenge in the design and implementation of Rumble is mapping the recursive structure of JSON documents and JSONiq queries onto Spark's execution primitives based on tabular data frames. Our solution is to translate a JSONiq expression into a tree of iterators that dynamically switch between local and distributed execution modes depending on the nesting level. By overcoming the impedance mismatch in the engine, Rumble frees the user from solving the same problem for every single query, thus increasing their productivity considerably. As we show in extensive experiments, Rumble is able to scale to large and complex data sets in the terabyte range with a similar or better performance than other engines. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.

Scalable Querying of Nested Data [Download Paper] (Oxford University), (Oxford University), (University of Edinburgh), (University of Edinburgh) While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.

Procedural Extensions of SQL: Understanding their usage in the wild [Download Paper] (Microsoft Research India), (Microsoft Research India) Procedural extensions of SQL have been in existence for many decades now. However, little is known about their magnitude of usage and their complexity in real-world workloads. Procedural code executing in a RDBMS is known to have inefficiencies and limitations; as a result there have been several efforts to address this problem. However, the lack of understanding of their use in real workloads makes it challenging to (a) motivate new work in this area, (b) identify research challenges and opportunities, and (c) demonstrate impact of novel work. We aim to address these challenges with our work.
In this paper, we present the results of our in-depth analysis of thousands of stored procedures, user-defined functions and triggers taken from several real workloads. We introduce SQL-ProcBench, a benchmark for procedural workloads in RDBMSs. SQL-ProcBench has been created using the insights derived from our analysis, and thus represents real workloads. Using SQL-ProcBench, we present an experimental evaluation on several database engines to understand and identify research challenges and opportunities. We emphasize the need to work on these interesting and relevant problems, and encourage researchers to contribute to this area.

11:00 – 12:00 CESTResearch Session 9: Metadata Chaired by Wim Martens

The Smallest Extraction Problem [Download Paper] (Roma Tre University), (Univ.of Roma 3), (Roma Tre University), (Roma Tre University) We introduce landmark grammars, a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem(SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Glean: Structured Extractions from Templatic Documents [Download Paper] (Google, USA), (Google), (Google), (Google), (Google), (Stanford University) Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.
We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Preference Queries over Taxonomic Domains [Download Paper] (Università di Bologna), (Politecnico di Milano), (Universita Roma Tre) When composing multiple preferences characterizing the most suitable results for a user, several issues may arise. Indeed, preferences can be partially contradictory, suffer from a mismatch with the level of detail of the actual data, and even lack natural properties such as transitivity. In this paper we formally investigate the problem of retrieving the best results complying with multiple preferences expressed in a logic-based language. Data are stored in relational tables with taxonomic domains, which allow the specification of preferences also over values that are more generic than those in the database. In this framework, we introduce two operators that rewrite preferences for enforcing the important properties of transitivity, which guarantees soundness of the result, and specificity, which solves all conflicts among preferences. Although, as we show, these two properties cannot be fully achieved together, we use our operators to identify the only two alternatives that ensure transitivity and minimize the residual conflicts. Building on this finding, we devise a technique, based on an original heuristics, for selecting the best results according to the two possible alternatives. We finally show, with a number of experiments over both synthetic and real-world datasets, the effectiveness and practical feasibility of the overall approach.

11:00 – 12:00 CESTResearch Session 10: Graph Traversal Chaired by Bogdan Arsintescu

ThunderRW: An In-Memory Graph Random Walk Engine [Download Paper] (National University of Singapore), (National University of Singapore), (National University of Singapore), (National University of Singapore), (Singapore Management University) As random walk (RW) is a powerful tool in many graph processing, mining and learning applications, this paper proposes an efficient in-memory random walk engine named ThunderRW. Compared with existing parallel systems on improving the performance of a single graph operation, ThunderRW supports massive parallel random walks. The core design of ThunderRW is motivated by our profiling results: RW algorithms have as high as 73.1% CPU pipeline cycles stalled due to irregular memory access, which suffers significantly more memory stalls than the conventional graph workloads of a single graph operation such as BFS and SSSP. To improve the memory efficiency, we first design a generic step-centric programming model named Gather-Move-Update to abstract different RW algorithms. Based on the programming model, we develop the step interleaving technique to hide memory access latency by switching the execution of different random walk queries. In our experiments, we use four representative RW-algorithms including PPR, DeepWalk, Node2Vec and MetaPath to demonstrate efficiency and programming flexibility of ThunderRW. Experiment results show that ThunderRW outperforms state-of-the-art approaches by an order of magnitude, and the step interleaving technique significantly reduces the CPU pipeline stall from 73.1% to 15.0%.

Towards an Efficient Weighted Random Walk Domination [Download Paper] (Wuhan University), (RMIT University), (huawei), ( Wuhan University, China) In this paper, we propose and study a new problem called the weighted random walk domination. Given a weighted graph $G(V, E)$ and a budget $B$ of the weighted random walk, it aims to find a $k$-size set $S$, which can minimize the total costs of the remaining nodes to access $S$ through the weighted random walk, which is bounded by $B$. This problem is critical to a range of real-world applications, such as advertising in social networks and telecommunication base station selection in wireless sensor networks. We first present a dynamic programming based greedy method (DpSel) as a baseline. DpSel is time-consuming when $|V|$ is huge. Thus, to overcome this drawback, we propose a matrix-based greedy method (MatrixSel), which can reduce the computation cost greatly. To further accelerate MatrixSel, we propose a BoundSel approach to reduce the number of the gain computations in each candidate selection by proactively estimating the upper bound of the marginal gain of the candidate node. Notably, all methods can achieve an approximation ratio of $(1 - 1/e)$. Experiments on real datasets have been conducted to verify the efficiency, effectiveness, memory consumption and scalability of our methods.

Maximizing Social Welfare in a Competitive Diffusion Model [Download Paper] (UBC), (The University of British Columbia), (Microsoft) Influence maximization (IM) has garnered a lot of attention in the literature owing to applications such as viral marketing and infection containment. It aims to select a small number of seed users to adopt an item such that adoption propagates to a large number of users in the network. Competitive IM focuses on the propagation of competing items in the network. Existing works on competitive IM have several limitations. (1) They fail to incorporate economic incentives in users' decision making in item adoptions. (2) Majority of the works aim to maximize the adoption of one particular item, and ignore the collective role that different items play. (3) They focus mostly on one aspect of competition -- pure competition. To address these concerns we study competitive IM under a utility-driven propagation model called UIC, and study social welfare maximization. The problem in general is not only NP-hard but also NP-hard to approximate within any constant factor. We, therefore, devise instant dependent efficient approximation algorithms for the general case as well as a $(1-1/e-\epsilon)$-approximation algorithm for a restricted setting. Our algorithms outperform different baselines on competitive IM, both in terms of solution quality and running time on large real networks under both synthetic and real utility configurations.

11:00 – 12:00 CESTResearch Session 11: Missing Values Chaired by Mourad Khayati

Missing Value Imputation on Multidimensional Time Series [Download Paper] (IIT Bombay), (IIT Bombay), (Indian Institute of Technology) We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data.
DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings.
Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

Adaptive Data Augmentation for Supervised Learning over Missing Data [Download Paper] (Renmin University of China), (Renmin University of China), (Renmin University of China), (Qatar Computing Research Institute, HBKU), (Tsinghua University), (Renmin University of China) Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the “repaired” source to train an ML model. During production, unlabeled target (a.k.a. test) data will also be repaired, and is then fed in the trained ML model for prediction. However, this process often causes a performance degradation when the source and target datasets are dirty with different noise patterns, which is common in practice.
In this paper, we propose an adaptive data augmentation approach, for handling missing data in supervised ML. The approach extracts noise patterns from target data, and adapts the source data with the extracted target noise patterns while still preserving supervision signals in the source. Then, it patches the ML model by retraining it on the adapted data, in order to better serve the target. To effectively support adaptive data augmentation, we propose a novel generative adversarial network (GAN) based framework, called DAGAN, which works in an unsupervised fashion. DAGAN consists of two connected GAN networks. The first GAN learns the noise pattern from the target, for target mask generation. The second GAN uses the learned target mask to augment the source data, for source data adaptation. The augmented source data is used to retrain the ML model. Extensive experiments show that our method significantly improves the ML model performance and is more robust than the state-of-the-art missing data imputation solutions for handling datasets with different missing value patterns.

Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions [Download Paper] (ETH Zürich), (GATECH), (Georgia Institute of Technology), (ETH Zürich), (GATECH), (Microsoft Research), (ETH) Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of ""Certain Predictions"" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction.
We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of ""data cleaning for machine learning"" (DC for ML). We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% of the gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% of the gap on average.

13:00 – 15:00 CESTResearch Session 12: Cardinality Estimation Chaired by Zoi Kaoudi

Fauce: Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation [Download Paper] (University of California, Merced), (University of California, Merced), (University of California, Merced), (Tencent Inc.) Cardinality estimation is a fundamental and critical problem in databases. Recently, many estimators based on deep learning have been proposed to solve this problem and they have achieved promising results. However, these estimators struggle to provide accurate results for complex queries, due to not capturing real inter-column and inter-table correlations. Furthermore, none of these estimators contain the uncertainty information about their estimations. In this paper, we present a join cardinality estimator called Fauce. Fauce learns the correlations across all columns and all tables in the database, it also contains the uncertainty information of each estimation. Among all studied learned estimators, our results are promising: (1) Fauce has the smallest model size; (2) It has the fastest inference speed; (3) Compared with the state of the art estimator, Fauce has 10 times faster inference speed, and provides 1.3 to 6.7 times smaller estimation errors for complex queries; (4) To the best of our knowledge, Fauce is the first estimator that incorporates cardinality estimation uncertainty information into a deep learning model.

FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation [Download Paper] (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Peking University) Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over-simplify the models only using independent factorization which leads to inaccurate estimates, or over-complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus combines their advantages. FLAT supports efficient online probability computation in near linear time on the underlying FSPN model, provides effective offline model construction and enables incremental model updates. It can estimate cardinality for both single table queries and multi-table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods: FLAT achieves 1–5 orders of magnitude better accuracy, 1–3 orders of magnitude faster probability computation speed and 1–2 orders of magnitude lower storage cost. We also integrate FLAT into Postgres to perform an end-to-end test. It improves the query execution time by 12.9% on the well-known IMDB benchmark workload, which is very close to the optimal result 14.2% using the true cardinality.

Astrid: Accurate Selectivity Estimation for String Predicates using Deep Learning [Download Paper] (The University of Texas at Arlington), (QCRI), (University of Toronto), (U. of Texas Arlington) Accurate selectivity estimation for string predicates is a longstanding research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement.
We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings ab', abc' and abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for ab' is closer to abc' than abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring and suffix queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Flow-Loss: Learning Cardinality Estimates That Matter [Download Paper] (MIT CSAIL), (MIT), (MIT), (MIT CSAIL), (Intel Labs and MIT), (MIT), (Massachusetts Institute of Technology) Recently there has been significant interest in using machine learn- ing to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, for learning cardinality estimation models. Flow-Loss approximates the optimizer’s cost model and search algorithm with analytical functions, which it uses to optimize explicitly for better query plans. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain “plan graph”, in which different paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB) which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the plan costs and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models perform well. However, the Q-Error-trained model degrades significantly when evaluated on slightly different queries (e.g., similar but unseen query templates), while the Flow-Loss trained model generalizes better to such situations, achieving 4-8x better 99th percentile runtimes on unseen templates with the same model architecture and training data

NeuroCard: One Cardinality Estimator for All Tables [Download Paper] (UC Berkeley), (UC Berkeley), (UC Berkeley), (UC Berkeley), (COVARIANT.AI), (COVARIANT.AI), (UC Berkeley) Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5× maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

13:00 – 15:00 CESTResearch Session 13: Spatial Paths and Maps Chaired by Constantinos Costa

Towards Crowd-aware Indoor Path Planning [Download Paper] (Aalborg University), (Aalborg University), (Roskilde University), (Monash University), (Zhejiang University) Indoor venues accommodate many people who collectively form crowds. Such crowds in turn influence people’s routing choices, e.g., people may prefer to avoid crowded rooms when walking from A to B. This paper studies two types of crowd-aware indoor path planning queries. The Indoor Crowd-Aware Fastest Path Query (FPQ) finds a path with the shortest travel time in the presence of crowds, whereas the Indoor Least Crowded Path Query (LCPQ) finds a path encountering the least objects en route. To process the queries, we design a unified framework with three major components. First, an indoor crowd model organizes indoor topology and captures object flows between rooms. Second, a time-evolving population estimator derives room populations for a future timestamp to support crowd-aware routing cost computations in query processing. Third, two exact and two approximate query processing algorithms process each type of query. All algorithms are based on graph traversal over the indoor crowd model and use the same search framework with different strategies of updating the populations during the search process. All proposals are evaluated experimentally on synthetic and real data. The experimental results demonstrate the efficiency and scalability of our framework and query processing algorithms.

AutoGR: Automated Geo-Replication with Fast System Performance and Preserved Application Semantics [Download Paper] (USTC), (USTC), (University of Science and Technology of China), (USTC), (University of Nevada, Reno), (Nanjing University), (University of Science and Technology of China) Geo-replication is essential for providing low latency response and quality Internet services. However, designing fast and correct geo-replicated services is challenging due to the complex trade-off between performance and consistency semantics in optimizing the expensive cross-site coordination. State-of-the-art solutions rely on programmers to derive sufficient application-specific invariants and code specifications, which is both time-consuming and error-prone. In this paper, we propose an end-to-end geo-replication deployment framework AutoGR (AUTOmated Geo-Replication) to free programmers from such label-intensive tasks. AutoGR enables the geo-replication features for non-distributed, serializable applications in an automated way with optimized performance and correct application semantics. Driven by a novel static analysis tool Rigi, AutoGR can extract invariants for applications by verifying whether their geo-replicated versions obey the serializable semantics of the non-distributed application. Rigi takes application codes as inputs and infers a set of possible side effects and path conditions possibly leading to consistency violations. Rigi employs the Z3 theorem prover to identify pairs of conflicting side effects and feed them to a geo-replication framework for automated across-site deployment. We evaluate AutoGR by transforming four DB-compliant applications that are originally non-replicated to geo-replicated ones across 3 sites. Compared with the state-of-the-art human-intervention-free automated approaches (e.g., for strong consistency), AutoGR reduces up to 61.8% latency and achieves up to 2.12X higher peak throughput; compared with state-of-the-art approaches relying on a manual analysis (e.g., PoR), AutoGR can quickly enable the geo-replication feature with zero human intervention while offering similarly low latency and high peak throughput.

PPQ-Trajectory: Spatio-temporal Quantization for Querying in Large Trajectory Repositories [Download Paper] (University of Warwick), (University of Warwick) We present PPQ-trajectory, a spatio-temporal quantization based solution for querying large dynamic trajectory data. PPQ-trajectory includes a partition-wise predictive quantizer (PPQ) that generates an error-bounded codebook with autocorrelation and spatial proximity-based partitions. The codebook is indexed to run approximate and exact spatio-temporal queries over compressed trajectories. PPQ-trajectory includes a coordinate quadtree coding for the codebook with support for exact queries. An incremental temporal partition-based index is utilised to avoid full reconstruction of trajectories during queries. An extensive set of experimental results for spatio-temporal queries on real trajectory datasets is presented. PPQ-trajectory shows significant improvements over the alternatives with respect to several performance measures, including the accuracy of results when the summary is used directly to provide approximate query results, the spatial deviation with which spatio-temporal path queries can be answered when the summary is used as an index, and the time taken to construct the summary. Superior results on the quality of the summary and the compression ratio are also demonstrated.

MDTP: A Multi-source Deep Traffic Prediction Framework over Spatio-Temporal Trajectory Data [Download Paper] (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University) Traffic prediction has drawn increasing attention for its ubiquitous real-life applications in traffic management, urban computing, public safety, and so on. Recently, the availability of massive trajectory data and the success of deep learning motivate a plethora of deep traffic prediction studies. However, the existing neural-network-based approaches tend to ignore the correlations between multiple types of moving objects located in the same spatio-temporal traffic area, which is suboptimal for traffic prediction analytics.
In this paper, we propose a multi-source deep traffic prediction framework over spatio-temporal trajectory data, termed as MDTP. The framework includes two phases: spatio-temporal feature modeling and multi-source bridging. We present an enhanced graph convolutional network (GCN) model combined with long short-term memory network (LSTM) to capture the spatial dependencies and temporal dynamics of traffic in the feature modeling phase. In the multi-source bridging phase, we propose two methods, Sum and Concat, to connect the learned features from different trajectory data sources. Extensive experiments on two real-life datasets show that MDTP i) has superior efficiency, compared with classical time-series methods, machine learning methods, and state-of-the-art neural network-based approaches; ii) offers a significant performance improvement over the single-source traffic prediction approach; and iii) performs traffic predictions in seconds even on tens of millions of trajectory data. Also, we develop MDTP+, a user-friendly interactive system to demonstrate traffic prediction analysis.

The Simpler The Better: An Indexing Approach for Shared-Route Planning Queries. [Download Paper] (Hong Kong University of Science and Technology), (Beihang University), (Beihang University), (Hong Kong University of Science and Technology) Ridesharing services have gained global popularity as a con-venient, economic, and sustainable transportation mode in recent years. One fundamental challenge in these services is planning the shared-routes (i.e., sequences of origins and destinations) among the passengers for the vehicles, such that the platform’s total revenue is maximized. Though many methods can solve this problem, their eﬀectiveness is still far from optimal on either empirical study (e.g., over 31% lower total revenue than our approach) or theoretical study (e.g., arbitrarily bad or impractical theoretical guar-antee). In this paper, we study the shared-route planning queries in ridesharing services and focus on designing eﬃ-cient algorithms with good approximation guarantees. Par-ticularly, our idea is to iteratively search the most prof-itable route among the unassigned requests for each vehicle, which is simpler than the existing methods. Unexpectedly, we prove this simple method has an approximation ratio of 0.5 to the optimal result. Moreover, we also design an index called additive tree to improve the eﬃciency and ap-ply randomization to improve the approximation guarantee. Finally, experimental results on two real datasets demon-strate that our additive-tree-based approach outperforms the state-of-the-art algorithms by obtaining up to 31.4%-127.4% higher total revenue.

QARTA: An ML-based System for Accurate Map Services [Download Paper] (University of Minnesota), (Qatar Computing Research Institute), (Qatar Computing Research Institute), (University of Minnesota - Twin Cities) Maps services (e.g., routing and store finding) are ubiquitous in widely used applications including navigation systems, ride sharing, and items/food delivery. There are plenty of efforts dedicated to supporting such services through designing more efficient algorithms, e.g., efficient shortest path and range queries. We believe that efficiency is no longer a bottleneck to these map services. Instead, it is the accuracy of the underlying road network and query result. This paper presents QARTA; an open-source full-fledged system for highly accurate and scalable map services. QARTA employs machine learning techniques to construct its own highly accurate map, not only in terms of map topology but more importantly, in terms of edge weights. QARTA also employs machine learning techniques to calibrate its query answers based on contextual information, including transportation modality, location, and time of day/week. QARTA is currently deployed in all Taxis in the State of Qatar and in the third-largest food delivery company in the country, replacing the commercial map service that was in use, and receiving hundreds of thousands of daily API calls with a real-time response time. Experimental evaluation of QARTA in such a real deployment environment shows that QARTA has comparable or higher accuracy than commercial services.

Fast Augmentation Algorithms for Network Kernel Density Visualization [Download Paper] (Hong Kong Baptist University), (The Hong Kong Polytechnic University), (University of Macau), (Hong Kong Baptist University), (The University of Hong Kong, China) Network kernel density visualization, or NKDV, has been extensively used to visualize spatial data points in various domains, including traffic accident hotspot detection, crime hotspot detection, disease outbreak detection, and business and urban planning. Due to a wide range of applications for NKDV, some geographical software, e.g., ArcGIS, can also support this operation. However, computing NKDV is very time-consuming. Although NKDV has been used for more than a decade in different domains, existing algorithms are not scalable to million-sized datasets. To address this issue, we propose three efficient methods in this paper, namely aggregate distance augmentation (ADA), interval augmentation (IA), and hybrid augmentation (HA), which can significantly reduce the time complexity for computing NKDV. In our experiments, ADA, IA and HA can achieve at least 5x to 10x speedup, compared with the state-of-the-art solutions.

13:00 – 15:00 CESTResearch Session 14: Graph Matching Chaired by Panagiotis Karras

Parallel Graph Algorithms in Constant Adaptive Rounds: Theory meets Practice [Download Paper] (University of Maryland), (MIT CSAIL), (Google Research), (Google), (Google), (Google) We study fundamental graph problems such as graph con-nectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching in a distributed setting. In particular, we focus on the Adaptive Massively Parallel Com-putation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table. We show the first AMPC algorithms for all of the studied problems that run in a constant number of rounds and use only O(n) space per machine, where 0 <  < 1. Our results improve both upon the previous results in the AMPC model, as well as the best-known results in the MPC model, which is the theoretical model underpinning many popular distributed computation frameworks, such as MapReduce, Hadoop, Beam, Pregel and Giraph. Finally, we provide an empirical comparison of the algo-rithms in the MPC and AMPC models in a fault-tolerant distributed computation environment. We empirically evalu-ate our algorithms on a set of large real-world graphs and show that our AMPC algorithms can achieve improvements in both running time and round-complexity over optimized MPC baselines.

Optimizing Bipartite Matching in Real-World Applications by Incremental Cost Computation [Download Paper] (GrabTaxi Holdings Pte Ltd), (GrabTaxi Holdings Pte Ltd), (National University of Singapore) The Kuhn-Munkres (KM) algorithm is a classical combinatorial optimization algorithm that is widely used for minimum cost bipartite matching in many real-world applications, such as transportation. For example, a ride-hailing service may use it to find the optimal assignment of drivers to passengers to minimize the overall wait time. Typically, given two bipartite sets, this process involves computing the edge costs between all bipartite pairs and finding an optimal matching. However, existing works overlook the impact of edge cost computation on the overall running time. In reality, edge computation often significantly outweighs the computation of the optimal assignment itself, as in the case of assigning drivers to passengers which involves computation of expensive graph shortest paths. Following on from this observation, we observe common real-world settings exhibit a useful property that allows us to incrementally compute edge costs only as required using an inexpensive lower-bound heuristic. This technique significantly reduces the overall cost of assignment compared to the original KM algorithm, as we demonstrate experimentally on multiple real-world data sets, workloads, and problems. Moreover, our algorithm is not limited to this domain and is potentially applicable in other settings where lower-bounding heuristics are available.

RECEIPT: REfine CoarsE-grained IndePendent Tasks for Parallel Tip decomposition of Bipartite Graphs [Download Paper] (USC), (USC), (Unversity of Southern California), (PUCRS) Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2, 2−bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update (peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization. In this paper, we propose a novel parallel tip-decomposition algorithm – REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the sequential restrictions on peeling by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time. We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms – achieving up to 1100x and 64x reduction in the number of thread synchronizations and traversed wedges, respectively. Using 36 threads, RECEIPT can provide up to 17.1x self-relative speedup. Our implementation of RECEIPT is available at https://github.com/kartiklakhotia/RECEIPT.

Symmetric Continuous Subgraph Matching with Bidirectional Dynamic Programming [Download Paper] (Seoul National University), (Seoul National University), (Seoul National University), (Universita Roma Tor Vergata), (LUISS University), (POSTECH) In many real datasets such as social media streams and cyber data sources, graphs change over time through a graph update stream of edge insertions and deletions. Detecting critical patterns in such dynamic graphs plays an important role in various application domains such as fraud detection, cyber security, and recommendation systems for social networks. Given a dynamic data graph and a query graph, the continuous subgraph matching problem is to find all positive matches for each edge insertion and all negative matches for each edge deletion. The state-of-the-art algorithm TurboFlux uses a spanning tree of a query graph for filtering. However, using the spanning tree may have a low pruning power because it does not take into account all edges of the query graph. In this paper, we present a symmetric and much faster algorithm SymBi which maintains an auxiliary data structure based on a directed acyclic graph instead of a spanning tree, which maintains the intermediate results of bidirectional dynamic programming between the query graph and the dynamic graph. Extensive experiments with real and synthetic datasets show that SymBi outperforms the state-of-the-art algorithm by up to three orders of magnitude in terms of the elapsed time.

Massively Parallel Algorithms for Personalized PageRank [Download Paper] (The Chinese University of Hong Kong), (The Chinese University of Hong Kong), (The Chinese University of Hong Kong), (Renmin University of China) Personalized PageRank (PPR) has wide applications in search engines, social recommendations, community detection, and so on. Nowadays, graphs are becoming massive and many IT companies need to deal with large graphs that cannot be fitted into the memory of most commodity servers. However, most existing state-of-the-art solutions for PPR computation only work for single-machines and are inefficient for the distributed framework since such solutions either {\em (i)} result in an excessively large number of communication rounds, or {\em (ii)} incur high communication costs in each round.
Motivated by this, we present {\em Delta-Push}, an efficient framework for single-source and top-$k$ PPR queries in distributed settings. Our goal is to reduce the number of rounds while guaranteeing that the load, i.e., the maximum number of messages an executor sends or receives in a round, can be bounded by the capacity of each executor. We first present a non-trivial combination of a redesigned parallel push algorithm and the Monte-Carlo method to answer single-source PPR queries. The solution uses pre-sampled random walks to reduce the number of rounds for the push algorithm. Theoretical analysis under the {\em Massively Parallel Computing (MPC)} model shows that our proposed solution bounds the communication rounds to $O(\log{\frac{n^2\log{n}}{\epsilon^2m}})$ under a load of $O(m/p)$, where $m$ is the number of edges of the input graph, $p$ is the number of executors, and $\epsilon$ is a user-defined error parameter. In the meantime, as the number of executors increases to $p' = \gamma \cdot p$, the load constraint can be relaxed since each executor can hold $O(\gamma \cdot m/p')$ messages with invariant local memory. In such scenarios, multiple queries can be processed in batches simultaneously. We show that with a load of $O(\gamma \cdot m/p')$, our Delta-Push can process $\gamma$ queries in a batch with $O(\log{\frac{n^2\log{n}}{\gamma\epsilon^2m}})$ rounds, while other baseline solutions still keep the same round cost for each batch. We further present a new top-$k$ algorithm that is friendly to the distributed framework and reduces the number of rounds required in practice. Extensive experiments show that our proposed solution is more efficient than alternatives.

18Aug

09:00 – 10:30 CESTResearch Session 16: NLP for Databases Chaired by Jacopo Urbani

The Case for NLP-Enhanced Database Tuning: Towards Tuning Tools that "Read the Manual" [Download Paper] (Cornell) A large body of knowledge on database tuning is available in the form of natural language text. We propose to leverage natural language processing (NLP) to make that knowledge accessible to automated tuning tools. We describe multiple avenues to exploit NLP for database tuning, and outline associated challenges and opportunities. As a proof of concept, we describe a simple prototype system that exploits recent NLP advances to mine tuning hints from Web documents. We show that mined tuning hints improve performance of MySQL and Postgres on TPC-H, compared to the default configuration.

Quality of Sentiment Analysis Tools: The Reasons of Inconsistency [Download Paper] (University of Paris), (University of Paris), (Université Paris Descartes), (Mohammed VI Polytechnic University), (University of Paris), (imba consulting) In this paper, we present a comprehensive study that evaluates six state-of-the-art sentiment analysis tools on five public datasets, based on the quality of predictive results in the presence of semantically equivalent documents, i.e., how consistent existing tools are in predicting the polarity of documents based on paraphrased text. We observe that sentiment analysis tools exhibit intra-tool inconsistency, which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency, which is the prediction of different polarity for semantically equivalent documents across different tools. We introduce a heuristic to assess the data quality of an augmented dataset and a new set of metrics to evaluate tool inconsistencies. Our results indicate that tool inconsistencies is still an open problem, and they point towards promising research directions and accuracy improvements that can be obtained if such inconsistencies are resolved.

CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs [Download Paper] (Carleton University), (Carleton University), (Carleton University) Recently, there has been an increase in the number of knowledge graphs that can be only queried by experts. However, describing questions using structured queries is not straightforward for non-expert users who need to have sufficient knowledge about both the vocabulary and the structure of the queried knowledge graph, as well as the syntax of the structured query language used to describe the user's information needs. The most popular approach introduced to overcome the aforementioned challenges is to use natural language to query these knowledge graphs. Although several question answering benchmarks can be used to evaluate question-answering systems over a number of popular knowledge graphs, choosing a benchmark to accurately assess the quality of a question answering system is a challenging task.
In this paper, we introduce CBench, an extensible, and more informative benchmarking suite for analyzing benchmarks and evaluating question answering systems. CBench can be used to analyze existing benchmarks with respect to several fine-grained linguistic, syntactic, and structural properties of the questions and queries in the benchmark. We show that existing benchmarks vary significantly with respect to these properties deeming choosing a small subset of them unreliable in evaluating QA systems. Until further research improves the quality and comprehensiveness of benchmarks, CBench can be used to facilitate this evaluation using a set of popular benchmarks that can be augmented with other user-provided benchmarks. CBench not only evaluates a question answering system based on popular single-number metrics but also gives a detailed analysis of the linguistic, syntactic, and structural properties of answered and unanswered questions to better help the developers of question answering systems to better understand where their system excels and where it struggles.

Improving Information Extraction from Visually Rich Documents using Visual Span Representations [Download Paper] (Ohio State University), (The Ohio State University) Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis – a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.

09:00 – 10:30 CESTResearch Session 17: Machine Learning for Databases Chaired by Tim Kraska

Explaining Inference Queries with Bayesian Optimization [Download Paper] (Simon Fraser University), (Simon Fraser University), (Simon Fraser University), (Simon Fraser University), (Columbia University) Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.

An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems [Download Paper] (Carnegie Mellon University), (Princeton University), (Societe Generale), (Carnegie Mellon University), (OtterTune), (Societe Generale), (Carnegie Mellon University) Modern database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application's workload can improve the performance and efficiency of the DBMS. But because of their complexity, tuning a DBMS often requires considerable efforts from experienced database administrators (DBAs). Recent work on automated tuning methods using machine learning (ML) have shown to achieve better performance compared with expert DBAs. These ML-based methods, however, were evaluated on synthetic workloads with limited tuning opportunities, and thus it is unknown whether they provide the same benefit in a production environment.
To better understand ML-based tuning, we conducted a thorough evaluation of ML-based DBMS knob tuning methods on an enterprise database application. We use the OtterTune tuning service to compare three state-of-the-art ML algorithms on an Oracle installation with a real workload trace. Our results with OtterTune show that these algorithms generate knob configurations that improve performance by up to 45% over enterprise-grade configurations. We also identify several deployment and measurement issues that we encountered in our study that were overlooked by previous research in automated DBMS tuning services.

Database Technology for the Masses: Sub-Operators as First-Class Entities [Download Paper] (TUM), (TU Munich) Relational databases have developed a wealth of technology over decades that has been successfully tried and tested in many settings and use cases. Yet, often most of it is left aside in the pursuit of performance (e.g., NoSQL) or new functionality (e.g., graph data, machine learning). In this paper, we argue that a wide range of techniques readily available in databases are crucial to tackling the challenges the IT industry faces in managing hardware trends, growing workloads, and the overall complexity of a quickly changing application and platform landscape.
However, to be truly useful, these techniques must be freed from the legacy component of database engines: relational operators. Therefore, we argue that in order to make databases a more flexible platform and extend their functionality to new data types and operations, it requires to expose a lower level of abstraction: instead of working with SQL, database engines should compile, optimize, and run a collection of sub-operators for manipulating and managing data, offering them as an external interface. In this paper, we discuss the advantages of doing so, provide a first list of such sub-operators, and show how they can be used in practice.

Data Acquisition for Improving Machine Learning Models [Download Paper] (York University), (York University), (University of Toronto) The vast advances in Machine Learning over the last ten years have been powered by the availability of suitably prepared data for training purposes. The future of ML-enabled enterprise hinges on data. As such there is already a vibrant market offering data annotation services to tailor sophisticated ML models.
In this paper, inspired by the recent vision of online data markets and associated market designs, we present research on the practical problem of obtaining data in order to improve the accuracy of ML models. We consider an environment in which consumers query for data to enhance the accuracy of their models and data providers that possess data and make them available for training purposes. We first formalize this interaction process laying out the suitable framework and associated parameters for data exchange. We then propose data acquisition strategies that consider a trade-off between exploration during which we obtain data to learn about the data distribution of a provider and exploitation during which we optimize our data inquiries utilizing the gained knowledge. We propose two algorithms for this purpose. First, the estimation and allocation (EA) strategy during which we utilize queries to estimate the utility of various predicates while learning about the data distribution of the provider; then we proceed with the allocation step in which we utilize these learned utilities to base our data acquisition decisions. The second algorithmic proposal, named Sequential Predicate Selection (SPS) utilizes a sampling strategy to explore the data distribution of the provider, adaptively investing more resources to parts of the data space that are statistically more promising to improve overall model accuracy.
We present a detailed experimental evaluation of our proposals utilizing a variety of ML models and associated real data sets exploring all applicable parameters of interest. Our results indicate the relative benefits of the proposed algorithms. Depending on the models we train and the associated learning tasks we identify trade-offs and highlight the relative benefits of each algorithm to further optimize model accuracy.

09:00 – 10:30 CESTResearch Session 18: Query Optimization Chaired by Sven Groppe

Accelerating Approximate Aggregation Queries with Expensive Predicates [Download Paper] (Stanford University), (Stanford University), (Stanford University), (Stanford), (University of Chicago), (Stanford and Databricks) Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard techniques to accelerate these queries in the form of approximate query processing are not applicable because they assume the result of the predicates are available ahead of time, i.e., as structured records in a database. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates.
To accelerate aggregation queries with expensive predicates, we develop a novel stratified sampling algorithm that leverages proxies (ABae). ABae must account for the key challenge that we may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, we propose a two-stage sampling algorithm that estimates optimal allocation for a fixed sampling budget and then samples according to this allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3x.

Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing [Download Paper] (U C IRVINE), (Alibaba Group), (Alibaba), (Alibaba), (Alibaba), (Alibaba), (Alibaba), (Alibaba), (Alibaba), (Alibaba Group), (Alibaba Group), (UC Irvine), (Alibaba Group) Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We evaluate Tempura in various incremental processing scenarios to show its effectiveness and efficiency.

Aggregated Deletion Propagation for Counting Conjunctive Query Answers [Download Paper] (Duke University), (Duke University), (Duke University), (Duke University), (Duke University, USA) We investigate the computational complexity of minimizing the source side-effect in order to remove a given number of tuples from the output of a conjunctive query. This is a variant of the well-studied deletion propagation problem, the difference being that we are interested in removing a smallest subset of input tuples to remove a given number of output tuples while deletion propagation focuses on removing a specific output tuple. We call this the Aggregated Deletion Propagation problem. We completely characterize the poly-time solvability of this problem for arbitrary conjunctive queries without self-joins. This includes a poly-time algorithm to decide solvability, as well as an exact structural characterization of NP-hard instances. We also provide a practical algorithm for this problem (a heuristic for NPhard instances) and evaluate its experimental performance on real and synthetic datasets.

Adaptive Code Generation for Data-Intensive Analytics [Download Paper] (Columbia University), (Columbia University), (Columbia University), (Oracle), (Oracle Labs) Modern database management systems employ sophisticated query optimization techniques that enable the generation of efficient plans for queries over very large data sets. A variety of other applications also process large data sets, but cannot leverage database-style query optimization for their code. We therefore identify an opportunity to enhance an open-source programming language compiler with database-style query optimization. Our system dynamically generates execution plans at query time, and runs those plans on chunks of data at a time. Based on feedback from earlier chunks, alternative plans might be used for later chunks. The compiler extension could be used for a variety of data-intensive applications, allowing all of them to benefit from this class of performance optimizations.

DISK: A Distributed Framework for Single-Source SimRank with Accuracy Guarantee [Download Paper] (Shenzhen Institute of Computing Sciences, Shenzhen University.), (University of Edinburgh), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Shenzhen University) Measuring similarities among different nodes is important in graph analysis. SimRank is one of the most popular similarity measures. Given a graph $G(V,E)$ and a source node $u$, a single-source SimRank query returns the similarities between $u$ and each node $v \in V$. This type of query is often used in link prediction, personalized recommendation and spam detection. While dealing with a large graph is beyond the ability of a single machine due to its limited memory and computational power, it is necessary to process single-source SimRank queries in a distributed environment, where the graph is partitioned and distributed across multiple machines. However, most current solutions are based on shared-memory model, where the whole graph is loaded into a shared memory and all processors can access the graph randomly. It is difficult to deploy such algorithms on shared-nothing model. In this paper, we present DISK, a distributed framework for processing single-source SimRank queries. DISK follows the linearized formulation of SimRank, and consists of offline and online phases. In the offline phase, a tree-based method is used to estimate the diagonal correction matrix of SimRank accurately, and in the online phase, single-source similarities are computed iteratively. Under this framework, we propose different optimization techniques to boost the indexing and queries. DISK guarantees both accuracy and parallel scalability, which distinguishes itself from existing solutions. Its accuracy, efficiency, parallel scalability and scalability are also verified by extensive experimental studies. The experiments show that DISK scales up to graphs of billions of nodes and edges, and answers online queries within seconds, while ensuring the accuracy bounds.

09:00 – 10:30 CESTResearch Session 19: Data Streams I Chaired by Fabian Hueske

In the Land of Data Streams where Synopses are Missing, One Framework to Bring Them All [Download Paper] (Technische Universität Berlin), (TU Berlin), (TU Berlin), (TU Berlin), (Technische Universität Berlin) In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses’ implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor’s key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.

PR-Sketch: Monitoring Per-key Aggregation of Streaming Data with Nearly Full Accuracy [Download Paper] (Peking University), (Chinese Academy of Sciences) Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like anomaly detection, attack prevention, and online diagnosis. Even though many algorithms have been proposed for per-key aggregation in stream processing, their accuracy guarantees only cover a small portion of keys. In this paper, we aim to achieve nearly full accuracy with limited resource usage. We follow the line of sketch-based techniques. We observe that existing methods suffer from high errors for most keys. The reason is that they track keys by complicated mechanism in the update phase and simply calculate per-key aggregation from some specific counter in the recovery phase. Therefore, we present PR-Sketch, a novel sketching design to address the two limitations. PR-Sketch builds linear equations between counter values and per-key aggregations to improve accuracy, and records keys in the recovery phase to reduce resource usage in the update phase. We also provide an extension called fast PR-Sketch to improve processing rate further. We derive space complexity, time complexity, and guaranteed error probability for both PR-Sketch and fast PRSketch. We conduct trace-driven experiments under 100K keys and 1M items to compare our algorithms with multiple state-of-the-art methods. Results demonstrate the resource efficiency and nearly full accuracy of our algorithms.

Ananke: A Streaming Framework for Live Forward Provenance [Download Paper] (Chalmers University of Technology), (Chalmers University of Technology and Volvo Car Corporation), (Chalmers University of Technology), (Chalmers University of Technology) Data streaming enables online monitoring of large and continuous event streams in Cyber-Physical Systems (CPSs). In such scenarios, fine-grained backward provenance tools can connect streaming query results to the source data producing them, allowing analysts to study the dependency/causality of CPS events. While CPS monitoring commonly produces many events, backward provenance does not help prioritize event inspection since it does not specify if an event's provenance could still contribute to future results.
To cover this gap, we introduce Ananke, a framework to extend any fine-grained backward provenance tool and deliver a live bipartite graph of fine-grained forward provenance. With Ananke, analysts can prioritize the analysis of provenance data based on whether such data is still potentially being processed by the monitoring queries. We prove our solution is correct, discuss multiple implementations, including one leveraging streaming APIs for parallel analysis, and show Ananke results in small overheads, close to those of existing tools for fine-grained backward provenance.

Optimization of Threshold Functions over Streams [Download Paper] (University of Washington), (Microsoft Research), (Microsoft Research), (Microsoft Research) A common stream processing application is alerting, where the data stream management system (DSMS) continuously evaluates a threshold function over incoming streams. If the threshold is crossed, the DSMS raises an alarm. The threshold function is often calculated over two or more streams, such as combining temperature and humidity readings to determine if moisture will form on a machine and therefore cause it to malfunction. This requires taking a temporal join across the input streams. We show that for the broad class of functions called quasiconvex functions, the DSMS needs to retain very few tuples per-data-stream for any given time interval and still never miss an alarm. This surprising result yields a large memory savings during normal operation. That savings is also important if one stream fails, since the DSMS would otherwise have to cache all tuples in other streams until the failed stream recovers. We prove our algorithm is optimal and provide experimental evidence that validates its substantial memory savings.

TRACE: Real-time Compression of Streaming Trajectories in Road Networks [Download Paper] (Aalborg Univeristy), (Zhejiang University), (Aalborg University), (Aalborg University) The deployment of vehicle location services generates increasingly massive vehicle trajectory data, which incurs high storage and transmission costs. A range of studies target offline compression to reduce the storage cost. However, to enable online services such as real-time traffic monitoring, it is attractive to also reduce transmission costs by being able to compress streaming trajectories in real-time. Hence, we propose a framework called TRACE that enables compression, transmission, and querying of network-constrained streaming trajectories in a fully online fashion. We propose a compact two-stage representation of streaming trajectories: a speed-based representation removes redundant information, and a multiple-references based referential representation exploits subtrajectory similarities. In addition, the online referential representation is extended with reference selection, deletion and rewriting functions that further improve the compression performance. An efficient data transmission scheme is provided for achieving low transmission overhead. Finally, indexing and filtering techniques support efficient real-time range queries over compressed trajectories. Extensive experiments with real-life and synthetic datasets evaluate the different parts of TRACE, offering evidence that it is able to outperform the existing representative methods in terms of both compression ratio and transmission cost.

09:00 – 10:30 CESTResearch Session 20: Graph Management I Chaired by Matteo Lissandrini

Collective Influence Maximization for Multiple Competing Products with an Awareness-to-Influence Model [Download Paper] (HKUST), (Amazon Search), (HKUST), (HKUST) Influence maximization (IM) is a fundamental task in social network analysis. Typically, IM aims at selecting a set of seeds for the network that influences the maximum number of individuals. Motivated by practical applications, in this paper we focus on an IM variant, where the owner of multiple competing products wishes to select seeds for each product so that the collective influence across all products is maximized. To capture the competing diffusion processes, we introduce an Awareness-to-Influence (AtI) model. In the first phase, awareness about each product propagates in the social graph unhindered by other competing products. In the second phase, a user adopts the most preferred product among those encountered in the awareness phase. To compute the seed sets, we propose GCW, a game-theoretic framework that views the various products as agents, which compete for influence in the social graph and selfishly select their individual strategy. We show that AtI exhibits monotonicity and submodularity; importantly, GCW is a monotone utility game. This allows us to develop an efficient best-response algorithm, with quality guarantees on the collective utility. Our experimental results suggest that our methods are effective, efficient, and scale well to large social networks.

Finding Group Steiner Trees in Graphs with both Vertex and Edge Weights [Download Paper] (National University of Singapore), (National University of Singapore), (Peking University), (University of Melbourne), (Stevens Institute of Technology), (Nanyang Technological University) Given an undirected graph and a number of vertex groups, the group Steiner tree problem is to find a tree such that (i) this tree contains at least one vertex in each vertex group; and (ii) the sum of vertex and edge weights in this tree is minimized. Solving this problem is useful in various scenarios, ranging from social networks to knowledge graphs. Most existing work focuses on solving this problem in vertex-unweighted graphs, and not enough work has been done to solve this problem in graphs with both vertex and edge weights. Here, we develop several algorithms to address this issue. Initially, we extend two algorithms from vertex-unweighted graphs to vertex- and edge-weighted graphs. The first one has no approximation guarantee, but often produces good solutions in practice. The second one has an approximation guarantee of |Γ| − 1, where |Γ| is the number of vertex groups. Since the extended (|Γ| − 1)- approximation algorithm is too slow when all vertex groups are large, we develop two new (|Γ| − 1)-approximation algorithms that overcome this weakness. Furthermore, by employing a dynamic programming approach, we develop another (|Γ|−ℎ+1)-approximation algorithm, where ℎ is a parameter between 2 and |Γ|. Experiments show that, while no algorithm is the best in all cases, our algorithms considerably outperform the state of the art in many scenarios.

Shortest Paths and Centrality in Uncertain Networks [Download Paper] (Nanyang Technological University), (CWI Amsterdam), (University of Vienna), (Nanyang Technological University), (ISI Foundation, Turin) Computing the shortest path between a pair of nodes is a fundamental graph primitive, which has critical applications in vehicle routing, finding functional pathways in biological networks, survivable network design, among many others. In this work, we study shortest-path queries over uncertain networks, i.e., graphs where every edge is associated with a probability of existence. We show that, for a given path, it is #P-hard to compute the probability of it being the shortest path, and we also derive other interesting properties highlighting the complexity of computing the Most Probable Shortest Paths (MPSPs). We thus devise sampling-based efficient algorithms, with end-to-end accuracy guarantees, to compute the MPSP. As a concrete application, we show how to compute a novel concept of betweenness centrality in an uncertain graph using MPSPs. Our thorough experimental results and rich real-world case studies on sensor networks and brain networks validate the effectiveness, efficiency, scalability, and usefulness of our solution.

Hierarchical Core Maintenance on Large Dynamic Graphs [Download Paper] (East China Normal University), (Guangzhou University), (University of New South Wales), (University of New South Wales), (Guangzhou University) The model of 𝑘-core and its decomposition have been applied in various areas, such as social networks, the world wide web, and biology. A graph can be decomposed into an elegant 𝑘-core hierarchy to facilitate cohesive subgraph discovery and network analysis. As many real-life graphs are fast evolving, existing works proposed efficient algorithms to maintain the coreness value of every vertex against structure changes. However, the maintenance of the 𝑘-core hierarchy in existing studies is not complete because the connections among different 𝑘-cores in the hierarchy are not considered. In this paper, we study hierarchical core maintenance which is to compute the 𝑘-core hierarchy incrementally against graph dynamics. The problem is challenging because the change of hierarchy may be large and complex even for a slight graph update. In order to precisely locate the area affected by graph dynamics, we conduct in-depth analyses on the structural properties of the hierarchy, and propose well-designed local update techniques. Our algorithms significantly outperform the baselines on runtime by up to 3 orders of magnitude, as demonstrated on 10 real-world large graphs.

Local Algorithms for Distance-generalized Core Decomposition over Large Dynamic Graphs [Download Paper] (Hong Kong Baptist University), (Hong Kong Baptist University), (Hong Kong Baptist University), (Hong Kong Baptist University) The distance-generalized core, also called (k,h)-core, is defined as the maximal subgraph in which every vertex has at least $k$ vertices at distance no longer than h. Compared with k-core, (k,h)-core can identify more fine-grained subgraphs and, hence, is more useful for the applications such as network analysis and graph coloring. The state-of-the-art algorithms for (k,h)-core decomposition are peeling algorithms, which iteratively delete the vertex with the minimum $h$-degree (i.e., the least number of neighbors within h hops). However, they suffer from some limitations, such as low parallelism and incapability of supporting dynamic graphs. To address these limitations, in this paper, we revisit the problem of (k,h)-core decomposition. First, we introduce two novel concepts of pairwise h-attainability index and n-order H-index based on an insightful observation. Then, we thoroughly analyze the properties of n-order H-index and propose a parallelizable local algorithm for (k,h)-core decomposition. Moreover, several optimizations are presented to accelerate the local algorithm. Furthermore, we extend the proposed local algorithm to address the (k,h)-core maintenance problem for dynamic graphs. Experimental studies on real-world graphs show that, compared to the best existing solution, our proposed algorithms can reduce the (k,h)-core decomposition time by 1-3 orders of magnitude and save the maintenance cost by 1-2 orders of magnitude.

11:00 – 12:00 CESTResearch Session 21: User Interaction Chaired by Eleni Tzirita Zacharatou

Robust Voice Querying with MUVE: Optimally Visualizing Results of Phonetically Similar Queries [Download Paper] (Cornell University), (Cornell), (Cornell University) Recently proposed voice query interfaces translate voice input into SQL queries. Unreliable speech recognition on top of the intrinsic challenges of text-to-SQL translation makes it hard to reliably interpret user input. We present MUVE (Multiplots for Voice quEries), a system for robust voice querying. MUVE reduces the impact of ambiguous voice queries by filling the screen with multiplots, capturing results of phonetically similar queries. It maps voice input to a probability distribution over query candidates, executes a selected subset of queries, and visualizes their results in a multiplot. Our goal is to maximize the probability to show the correct query result. Also, we want to optimize the visualization(e.g., by coloring a subset of likely results) in order to minimize the expected time until users find the correct result. Via a user study, we validate a simple cost model estimating the latter overhead. The resulting optimization problem is NP-hard. We propose an exhaustive algorithm, based on integer programming, as well as a greedy heuristic. As shown in a corresponding user study, MUVE enables users to identify accurate results faster, compared to prior work.

PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes [Download Paper] (University of California, Irvine), (UC Irvine) In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by providing the data scientists' familiar interface, Pandas Dataframe, and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a large volume of data, the initial version only generated SQL++ queries and only operated against AsterixDB. In this work, we describe a new design that retargets AFrame’s incremental query formation to other query-based database systems, making it more flexible for deployment against other data management systems with composable query languages.

11:00 – 12:00 CESTResearch Session 22: Search Chaired by Constantinos Costa

LES3: Learning-based exact set similarity search [Download Paper] (York University), (York University), (University of Toronto) Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach, first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic data sets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.

DeltaPQ: Lossless Product Quantization Code Compression for High Dimensional Similarity Search [Download Paper] (Rutgers University), High dimensional data is ubiquitous and plays an impor-tant role in many applications. However, the size of high dimensional data is usually excessively large. To alleviate this problem, in this paper, we develop novel techniques to compress and search high dimensional data. Speciﬁcally, we ﬁrst apply vector quantization, a classical lossy data com-pression method. It quantizes a high dimensional vector to a sequence of small integers, namely the quantization code. Then, we propose a novel lossless compression algo-rithm, DeltaPQ, to further compress the quantization codes. DeltaPQ organizes the quantization codes in a tree struc-ture and stores the diﬀerences between two quantization codes rather than the original codes. Among the exponential number of possible tree structures, we develop an eﬃcient algorithm, whose time and space complexity are linear to the number of codes, to ﬁnd the one with optimal compres-sion ratio. The approximate nearest neighbor search query can be processed directly on the compressed data with small space overhead in a few bytes. Many similarity measures can be supported, such as inner product, cosine similarity, Eu-clidean distance, and Lp-norm. Experimental results on ﬁve large-scale real-world datasets show that DeltaPQ achieves a compression ratio of up to 5 (and often greater than 2) on the quantization codes whereas the state-of-art general-purpose lossless compression algorithms barely work.

Budget Constrained Interactive Search for Multiple Targets [Download Paper] (Hong Kong Baptist University), (Hong Kong Baptist University), (Hong Kong Baptist University), (Hong Kong Baptist University), (Harbin Institute of Technology), (Hong Kong Baptist University) Interactive graph search leverages human intelligence to categorize target labels in a hierarchy, which are useful for image classification, product categorization, and database search. However, many existing studies of interactive graph search aim at identifying a single target optimally, and suffer from the limitations of asking too many questions and not being able to handle multiple targets. To address these two limitations, in this paper, we study a new problem of Budget constrained Interactive Graph Search for Multiple targets called lBM-IGS-problem. Specifically, given a set of multiple targets T in a hierarchy, and two parameters k and b, the goal is to identify a k-sized set of selections S such that the closeness between selections S and targets T is as small as possible, by asking at most a budget of b questions. We theoretically analyze the updating rules and design a penalty function to capture the closeness between selections and targets. To tackle the kBM-IGS-problem, we develop a novel framework to ask questions using the best vertex with the largest expected gain, which makes a balanced trade-off between target probability and benefit gain. Based on the kBM-IGS framework, we then propose an efficient algorithm STBIS to tackle a special case of the SingleTarget problem. Furthermore, we propose a dynamic programming based method kBM-DP to tackle kBM-IGS. To further improve efficiency, we propose two approximate methods kBM-Topk and kBM-DP+. Extensive experiments on large real-world datasets with ground-truth targets verify both the effectiveness and efficiency of our proposed algorithms.

11:00 – 12:00 CESTResearch Session 23: Special Hardware Chaired by Zsolt Istvan

Efficient Join Algorithms For Large Database Tables in a Multi-GPU Environment [Download Paper] (University of South Florida), (University of South Florida), (University of South Florida) Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.9X over best known code developed for multi-core CPUs and GPUs respectively.

Improving Execution Efficiency of Just-in-time Compilation based Query Processing on GPUs [Download Paper] (NUS), (National University of Singapore), (National University of Singapore), (Nanyang Technological University) In recent years, we have witnessed significant efforts to improve the performance of Online Analytical Processing (OLAP) on graphics processing units (GPUs). Most existing studies have focused on improving memory efficiency since memory stalls can play an essential role in query processing performance on GPUs. Motivated by the recent rise of just-in-time (JIT) compilation in query processing, we investigate whether and how we can further improve query processing performance on GPU. Specifically, we study the execution of state-of-the-art JIT compile-based query processing systems. We find that thanks to advanced techniques such as database compression and JIT compilation, memory stalls are no longer the most significant bottleneck. Instead, current JIT compile-based query processing encounters severe under-utilization of GPU hardware due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to improve GPU utilization during query execution. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan in order to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention helps to improve the accuracy of the cost model, 2) the techniques proposed in this study help Pyper improve the performance of TPC-H and SSB queries on average by 1.60x and 1.52x, respectively.

Scotch: Generating FPGA-Accelerators for Sketching at Line Rate [Download Paper] (TU Berlin), (TU Berlin), (TU Berlin), (Technische Universität Berlin) Sketching algorithms are a powerful tool for single-pass data summarization. Their numerous applications include approximate query processing, machine learning, and large-scale network monitoring. In the presence of high-bandwidth interconnects or in-memory data, the throughput of summary maintenance over input data becomes the bottleneck. While FPGAs have shown admirable throughput and energy-efficiency for data processing tasks, developing FPGA accelerators requires a sophisticated hardware design and expensive manual tuning by an expert.
We propose Scotch, a novel system for accelerating sketch maintenance using FPGAs. Scotch provides a domain-specific language for the user-friendly, high-level definition of a broad class of sketching algorithms. A code generator performs the heavy-lifting of hardware description, while an auto-tuning algorithm optimizes the summary size. Our evaluation shows that FPGA accelerators generated by Scotch outperform CPU- and GPU-based sketching by up to two orders of magnitude in terms of throughput and up to a factor of five in terms of energy efficiency.

11:00 – 12:00 CESTResearch Session 24: Efficient Query Processing Chaired by Peter Boncz

Decomposed Bounded Floats for Fast Compression and Queries [Download Paper] (University of Chicago), (University of Chicago), (University of Chicago), (University of Chicago) Modern data-intensive applications often generate large amounts of low precision float data with a limited range of values. Despite the prevalence of this data category, there is a lack of an effective solution to ingest, store, and analyze such bounded, low-precision, numeric data. To address this gap, we propose Buff, a new compression technique that uses a decomposed columnar storage and encoding methods to provide effective compression, fast ingestion, and high-speed in-situ adaptive query operators with SIMD support.

Phoebe: A Learning-based Checkpoint Optimizer [Download Paper] (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Facebook), (Google), (Microsoft) Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft.
To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment.
We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.

FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS [Download Paper] (University of California, San Diego), (Burnian), (MIT), (University of Wisconsin, Madison), (University of Wisconsin-Madison), (University of Massachusetts Amherst), (QCRI), (MIT) Modern cloud databases adopt a storage-disaggregation architecture that separates the management of computation and storage. A major bottleneck in such an architecture is the network connecting the computation and storage layers. Two solutions have been explored to mitigate the bottleneck: caching and computation pushdown. While both techniques can significantly reduce network traffic, existing DBMSs consider them as orthogonal techniques and support only one or the other, leaving potential performance benefits unexploited. In this paper we present FlexPushdownDB (FPDB), an OLAP cloud DBMS prototype that supports fine-grained hybrid query execution to combine the benefits of caching and computation pushdown in a storage-disaggregation architecture. We build a hybrid query executor based on a new concept called separable operators to combine the data from the cache and results from the pushdown processing. We also propose a novel Weighted-LFU cache replacement policy that takes into account the cost of pushdown computation. Our experimental evaluation on the Star Schema Benchmark shows that the hybrid execution outperforms both the conventional caching-only architecture and pushdown-only architecture by 2.2×. In the hybrid architecture, our experiments show that Weighted-LFU can outperform the baseline LFU by 37%.

11:00 – 12:00 CESTResearch Session 25: Data Synthesis and Labeling Chaired by Damianos Chatziantoniou

Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search [Download Paper] (The UNIVERSITY of CHICAGO), (Microsoft Research), (Microsoft) Recent work have made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data-pipelines with both string-transformations and table-manipulation operators.
We propose a novel by-target-schema paradigm that allows users to easily specify a desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target-schema, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from a desired pipeline on the given input should schematically "look like". While such a specification is easy for users to provide, it is fuzzy in nature, making the problem seemingly under-specified. Our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space of feasible solutions and make successful synthesis possible. We develop a \sj{} system that leverages search and deep reinforcement-learning (DRL) based algorithms that learns to synthesize pipelines via "self-play" on pipelines authored by human experts (reminiscent of self-play learning in Atari and AlphaGo). Experiments on a large number of real pipelines (extracted from Jupyter Notebooks on GitHub and commercial vendors) suggest that \sj{} can successfully synthesize 60-70\% of these complex pipelines with up to 10 pipeline steps in 10-20 seconds.

Kamino: Constraint-Aware Differentially Private Data Synthesis [Download Paper] (University of Waterloo), (University of Waterloo), (University of Waterloo), (U. of Waterloo) Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved.
This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.

Inspector Gadget: A Data Programming-based Labeling System for Industrial Images [Download Paper] (KAIST), (KAIST), (KAIST), (KAIST), (KAIST) As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed where only a small portion is problematic (e.g., identifying defects on a surface). Since manual labeling these images is expensive, weak supervision is an attractive alternative where the idea is to generate weak labels that are not perfect, but can be produced at scale. Data programming is a recent paradigm in this category where it uses human knowledge in the form of labeling functions and combines them into a generative model. Data programming has been successful in applications based on text or structured data and can also be applied to images usually if one can find a way to convert them into structured data. In this work, we expand the horizon of data programming by directly applying it to images without this conversion, which is a common scenario for industrial applications. We propose Inspector Gadget, an image labeling system that combines crowdsourcing, data augmentation, and data programming to produce weak labels at scale for image classification. We perform experiments on real industrial image datasets and show that Inspector Gadget obtains better performance than other weak-labeling techniques: Snuba, GOGGLES, and self-learning baselines using convolutional neural networks (CNNs) without pre-training.

11:00 – 12:00 CESTResearch Session 26: Deep Learning I Chaired by Hazar Harmouch

LANCET: Labeling Complex Data at Scale [Download Paper] (WPI), (MIT), (MIT), (Worcester Polytechnic Institute) Cutting-edge machine learning techniques such as deep learning are label thirsty, often requiring millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques from weak supervision to active learning provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a solid theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin -- up to 30 percentage points increase in accuracy.

Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics [Download Paper] (Stanford University), (Stanford University), (Stanford University), (Stanford University), (Stanford and Databricks) While deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in visual analytics systems on modern accelerators. Through a novel measurement study, we show that the preprocessing of data (e.g., decoding, resizing) can be the bottleneck in many visual analytics systems on modern hardware.
To address the bottleneck of preprocessing, we introduce two optimizations for end-to-end visual analytics systems. First, we introduce novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data. Second, we develop a runtime engine for efficient visual DNN inference. This runtime engine a) efficiently pipelines preprocessing and DNN execution for inference, b) places preprocessing operations on the CPU or GPU in a hardware- and input-aware manner, and c) efficiently manages memory and threading for high throughput execution. We implement these optimizations in a novel system, Smol, and evaluate Smol on eight visual datasets. We show that its optimizations can achieve up to 5.9x end-to-end throughput improvements at a fixed accuracy over recent work in visual analytics.

EdgeDIPN: a Unified Deep Intent Prediction Network Deployed at the Edge [Download Paper] (Alibaba inc.), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Alibaba), (Peking University) With the rapid growth of e-commerce in recent years, e-commerce platforms are becoming a primary place for people to find, compare and ultimately purchase products. To improve online shopping experience for consumers and increase sales for sellers, it is important to understand user intent accurately and be notified of its change timely. In this way, the right information could be offered to the right person at the right time. To achieve this goal, we propose a unified deep intent prediction network, named EdgeDIPN, which is deployed at the edge, i.e., mobile device, and able to monitor multiple user intent with different granularity simultaneously in real-time. We propose to train EdgeDIPN with multi-task learning, by which EdgeDIPN can share representations between different tasks for better performance and saving edge resources in the meantime. In particular, we propose a novel task-specific attention mechanism which enables different tasks to pick out the most relevant features from different data sources. To extract the shared representations more effectively, we utilize two kinds of attention mechanisms, where the multi-level attention mechanism tries to identify the important actions within each data source and the inter-view attention mechanism learns the interactions between different data sources. In the experiments conducted on a large-scale industrial dataset, EdgeDIPN significantly outperforms the baseline solutions. Moreover, EdgeDIPN has been deployed in the operational system of Alibaba. Online A/B testing results in several business scenarios reveal the potential of monitoring user intent in real-time. To the best of our knowledge, EdgeDIPN is the first full-fledged real-time user intent understanding center deployed at the edge and serving hundreds of millions of users in a large-scale e-commerce platform.

11:00 – 12:00 CESTResearch Session 27: Graph Processing I Chaired by Wim Martens

EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal in GPUs [Download Paper] (University of Illinois at Urbana-Champaign), (University of Illinois at Urbana-Champaign), (University of Illinois at Urbana-Champaign), (IBM Thomas J. Watson Research Center), (NVIDIA), (University of Illinois at Urbana-Champaign) Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as compressed sparse row (CSR). To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data preprocessing/partitioning or unified virtual memory (UVM) to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cache-line-sized access to data stored in host memory.
This paper addresses the open question of whether a sufficiently large number of overlapping cache-line-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth with direct cache-line accesses to the host memory. EMOGI achieves 2.60X speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.

Efficiently Answering Reachability and Path Queries on Temporal Bipartite Graphs [Download Paper] (University of New South Wales), (University of New South Wales), (University of New South Wales), (University of New South Wales), (UTS), (University of Technology Sydney) Bipartite graphs are naturally used to model relationships between two different types of entities, such as people-location, author-paper, and customer-product. When modeling real-world applications like disease outbreaks, edges are often enriched with temporal information, leading to temporal bipartite graphs. While reachability has been extensively studied on (temporal) unipartite graphs, it remains largely unexplored on temporal bipartite graphs. To fill this research gap, in this paper, we study the reachability problem on temporal bipartite graphs. Specifically, a vertex u reaches a vertex w in a temporal bipartite graph G if u and w are connected through a series of consecutive wedges with time constraints. Towards efficiently answering if a vertex can reach the other vertex, we propose an index-based method by adapting the idea of 2-hop labeling. Effective optimization strategies and parallelization techniques are devised to accelerate the index construction process. To better support real-life scenarios, we further show how the index is leveraged to efficiently answer other types of queries, e.g., single-source reachability query and earliest-arrival path query. Extensive experiments on 16 real-world graphs demonstrate the effectiveness and efficiency of our proposed techniques.

Efficient Bi-triangle Counting for Large Bipartite Networks [Download Paper] (University of New South Wales), (School of Data Science, The Chinese University of Hong Kong, Shenzhen), (Polish-Japanese Institute of Information Technology), (University of New South Wales), (University of New South Wales) A bipartite network is a network with two disjoint vertex sets and its edges only exist between vertices from different sets. It has received much interest since it can be used to model the relationship between two different sets of objects in many applications (e.g., the relationship between users and items in E-commerce). In this paper, we study the problem of efficient bi-triangle counting for a large bipartite network, where a bi-triangle is a cycle with three vertices from one vertex set and three vertices from another vertex set. Counting bi-triangles has found many real applications such as computing the transitivity coefficient and clustering coefficient for bipartite networks. To enable efficient bi-triangle counting, we first develop a baseline algorithm relying on the observation that each bi-triangle can be considered as the join of three wedges. Then, we propose a more sophisticated algorithm which regards a bi-triangle as the join of two super-wedges, where a wedge is a path with two edges while a super-wedge is a path with three edges. We further optimize the algorithm by ranking vertices according to their degrees. We have performed extensive experiments on both real and synthetic bipartite networks, where the largest one contains more than one billion edges, and the results show that the proposed solutions are up to five orders of magnitude faster than the baseline method.

13:30 – 14:30 CESTResearch Session 28: Data Streams II Chaired by Tyler Akidau

Real-Time Distance-Based Outlier Detection in Data Streams [Download Paper] (University of Southern California), (University of Southern California), (Computer Science Department. University of Southern California) Real-time outlier detection in data streams has drawn much attention recently as many applications need to be able to detect abnormal behaviors as soon as they occur. The arrival and departure of streaming data on edge devices impose new challenges to process the data quickly in real-time due to memory and CPU limitations of these devices. Existing methods are slow and not memory efficient as they mostly focus on quick detection of inliers and pay less attention to expediting neighbor searches for outlier candidates. In this study, we propose a new algorithm, CPOD, to improve the efficiency of outlier detections while reducing its memory requirements. CPOD uses a unique data structure called "core point" with multi-distance indexing to both quickly identify inliers and reduce neighbor search spaces for outlier candidates. We show that with six real-world and one synthetic dataset, CPOD is, on average, 10, 19, and 73 times faster than M_MCOD, NETS, and MCOD, respectively, while consuming low memory.

SAND: Streaming Subsequence Anomaly Detection [Download Paper] (Université de Paris), (University of Chicago), (University of Paris), (University of Chicago) Subsequence anomaly detection in long sequences is an important problem with applications in a wide range of domains. With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In addition, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these problems, we propose SAND, a novel online method suitable for domain-agnostic anomaly detection. The key idea of SAND is the detection of anomalies based on their (dis)similarity to a model that represents normal behavior. SAND relies on a novel steaming methodology to incrementally update such model, which adapts to distribution drifts and omits obsolete data. The experimental results on several real datasets demonstrate that SAND correctly identifies single and recurrent anomalies of various types, without prior knowledge of the characteristics of these anomalies. In addition, SAND outperforms by a large margin the current state-of-the-art algorithms in terms of accuracy, while achieving orders of magnitude speedups.

13:30 – 14:30 CESTResearch Session 29: Persistent Memory II Chaired by Alberto Lerner

Understanding the Idiosyncrasies of Real Persistent Memory [Download Paper] (The Ohio State University), (The Ohio State University), (The Ohio State University) High capacity persistent memory (PMEM) is finally commercially available in the form of Intel's Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture.
In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.

Persistent Memory Hash Indexes: An Experimental Evaluation [Download Paper] (College of Computer Science and Electronic Engineering, Hunan University, China), (Hunan University), (Peking University Shenzhen Graduate School), (College of Computer Science and Electronic Engineering, Hunan University, China), (College of Computer Science and Electronic Engineering, Hunan University, China) Persistent memory (PM) is increasingly being leveraged to build hash-based indexing structures featuring cheap persistence, high performance, and instant recovery, especially with the recent release of Intel Optane DC Persistent Memory Modules. However, most of them are evaluated on DRAM-based emulators with unreal assumptions, or focus on the evaluation of specific metrics with important properties sidestepped. Thus, it is essential to understand how well the proposed hash indexes perform on real PM and how they differentiate from each other if a wider range of performance metrics are considered.
To this end, this paper provides a comprehensive evaluation of persistent hash tables. In particular, we focus on the evaluation of six state-of-the-art hash tables including Level hashing, CCEH, Dash, PCLHT, Clevel, and SOFT, with real PM hardware. Our evaluation was conducted using a unified benchmarking framework and representative workloads. Besides characterizing common performance properties, we also explore how hardware configurations (such as PM bandwidth, CPU instructions, and NUMA) affect the performance of PM-based hash tables. With our in-depth analysis, we identify design trade-offs and good paradigms in prior arts, and suggest desirable optimizations and directions for the future development of PM-based hash tables.

Viper: An Efficient Hybrid PMem-DRAM Key-Value Store [Download Paper] (Hasso Plattner Institute, University of Potsdam), (Hasso Plattner Institute, University of Potsdam), (HPI, University of Potsdam) Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4--18x for write workloads, while matching or surpassing their get performance.

13:30 – 14:30 CESTResearch Session 30: AutoML Chaired by Wolfgang Lehner

VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition [Download Paper] (Peking University), (Peking University), (Peking University), (ETH Zurich), (Alibaba Group), (Data Analytics and Intelligence Lab, Alibaba Group), (Alibaba Group), (Peking University), (Microsoft Research), (ETH), (Peking University) End-to-end AutoML has attracted intensive interests from both academia and industry, which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model -- akin to the one supported by modern database systems -- to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.

How to Design Robust Algorithms using Noisy Comparison Oracle [Download Paper] (University of Massachusetts Amherst), (University of Massachusetts Amherst), (University of California, Berkeley) Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak supervision in the form of a comparison oracle that compares the relative distance between the queried points such as Is point $u$ closer to $v$ or $w$ closer to $x$?'. However, it is possible that some queries are easier to answer than others using a comparison oracle. We capture this by introducing two different noise models called adversarial and probabilistic noise. In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search under these noise models. Building upon the techniques we develop for these problems, we give robust algorithms for $k$-center clustering and agglomerative hierarchical clustering. We prove that our algorithms achieve good approximation guarantees with a high probability and analyze their query complexity. We evaluate the effectiveness and efficiency of our techniques empirically on various real-world datasets.

Doing More with Less: Characterizing Dataset Downsampling for AutoML [Download Paper] (ETH Zurich), (MIT), (MIT), (TU Wien and MIT) Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small datasets, the same procedure does not directly scale to larger datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling -- reducing the number of rows in the input tabular dataset -- has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.

13:30 – 14:30 CESTResearch Session 31: Implementing DBMS Chaired by Ippokratis Pandis

Scaling Replicated State Machines with Compartmentalization [Download Paper] (UC Berkeley), (Microsoft), (University of New Hampshire), (University at Buffalo, SUNY), (UC Berkeley), (UC Berkeley), (University of Cambridge), (UC Berkeley), (VMware) State machine replication protocols, like MultiPaxos and Raft, are a critical component of many distributed systems and databases. However, these protocols offer relatively low throughput due to several bottlenecked components. Numerous existing protocols fix different bottlenecks in isolation but fall short of a complete solution. When you fix one bottleneck, another arises. In this paper, we introduce compartmentalization, the first comprehensive technique to eliminate state machine replication bottlenecks. Compartmentalization involves decoupling individual bottlenecks into distinct components and scaling these components independently. Compartmentalization has two key strengths. First, compartmentalization leads to strong performance. In this paper, we demonstrate how to compartmentalize MultiPaxos to increase its throughput by 6x on a write-only workload and 16x on a mixed read-write workload. Unlike other approaches, we achieve this performance without the need for specialized hardware. Second, compartmentalization is a technique, not a protocol. Industry practitioners can apply compartmentalization to their protocols incrementally without having to adopt a completely new protocol.

Hindsight Logging for Model Training [Download Paper] (UC Berkeley), (UC Berkeley), (UC Berkeley), (UC Berkeley), (UC Berkeley), (UC Berkeley), (UC Berkeley), (University of California, Berkeley) In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.

In-Network Support for Transaction Triaging [Download Paper] (Stanford University), (University of Friborug), (Università della Svizzera italiana), (Yale University), (Exascale Infolab, Fribourg University) We introduce Transaction Triaging, a set of techniques that manipulate streams of transaction requests and responses while they travel to and from a database server. Compared to normal transaction streams, the triaged ones execute faster once they reach the database. The triaging algorithms do not interfere with the transaction execution nor require adherence to any particular concurrency control method, making them easy to port across database systems.
Transaction Triaging leverages recent programmable networking hardware that can perform computations on in-flight data. We evaluate our techniques on an in-memory database system using an actual programmable hardware network switch. Our experimental results show that triaging brings enough performance gains to compensate for almost all networking overheads. In high-overhead network stacks such as UDP/IP, we see throughput improvements from 2.05× to 7.95×. In an RDMA stack, the gains range from 1.08× to 1.90× without introducing significant latency.

13:30 – 14:30 CESTResearch Session 32: Provenance Chaired by Martin Hentschel

Compact, Tamper-Resistant Archival of Fine-Grained Provenance [Download Paper] (University of Pennsylvania), (University of Pennsylvania) Data provenance tools aim to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. Fine-grained provenance further enables sophisticated reasoning about why individual output results appear or fail to appear — aiding debugging and diagnosis. However, for provenance to be truly useful for reproducibility and auditing, we need a provenance archival system that ensures it is tamper-resistant, and that storing provenance collected over many queries and across time is efficient (i.e., it compresses repeated results). In this paper we study this problem, developing solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using a variety of workloads based on scientific and OLAP workloads.

Capturing and querying fine-grained provenance of preprocessing pipelines in data science [Download Paper] (University of Southampton), (Newcastle University), (Universita Roma Tre), (Universita Roma Tre) Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.

Fine-Grained Lineage for Safer Notebook Interactions [Download Paper] (University of Illinois at Urbana-Champaign), (University of California, Berkeley), (University of California, Berkeley), (UC Berkeley), (UC Berkeley), (University of California, Berkeley) Computational notebooks have emerged as the platform of choice for data science and analytical workflows, enabling rapid iteration and exploration. By keeping intermediate program state in memory and segmenting units of execution into so-called "cells", notebooks allow users to execute their workflows interactively and enjoy particularly tight feedback. However, as cells are added, removed, reordered, and rerun, this hidden intermediate state accumulates in a way that is not necessarily correlated with the code visible in the notebook's cells, making execution behavior difficult to reason about, and leading to errors and lack of reproducibility. We present NBSafety, a custom Jupyter kernel that uses runtime tracing and static analysis to automatically manage lineage associated with cell execution and global notebook state. NBSafety detects and prevents errors that users make during unaided notebook interactions, all while preserving the flexibility of existing notebook semantics. We evaluate NBSafety's ability to prevent erroneous interactions by replaying and analyzing 666 real notebook sessions. Of these, NBSafety identified 117 sessions with potential safety errors, and in the remaining 549 sessions, the cells that NBSafety identified as resolving safety issues were more than 7x more likely to be selected by users for re-execution compared to a random baseline, even though the users were not using NBSafety and were therefore not influenced by its suggestions.

13:30 – 14:30 CESTResearch Session 33: Database Applications Chaired by Jorge Anulfo Quiane Ruiz

On the String Matching with k Differences in DNA Databases [Download Paper] (University of Winnipeg), (University of Winnipeg) In this paper, we discuss an efficient and effective index mechanism for the string matching with \textit{k} differences, by which we will find all the substrings of a target string $y$ of length $n$ that align with a pattern string $x$ of length $m$ with not more than $k$ insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, $n$ is often on the order of millions or billions while $m$ is just a hundred or a thousand. The main idea of our method is to transform \textit{y} to a BWT-array as an index, denoted as \textit{BWT}(\textit{y}), and search \textit{x} against it. The time complexity of our method is bounded by O($k\cdot|T|$), where $T$ is a tree structure dynamically generated during a search of \textit{BWT}(\textit{y}). The average value of $|T|$ is bounded by O($|\Sigma|^2k}$), where $\Sigma$ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies for the cases of $k$ $\leq$ O(log$_{|\Sigma|}n$). The general working process consists of two steps. In the first step, $x$ is decomposed into a series of $l$ small subpatterns, and \textit{BWT}($y$) is utilized to speed up the process to figure out all the occurrences of such subpatterns with $\lfloor$$k/l$$\rfloor$ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match $x$, but with $k$ differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

A four-dimensional Analysis of Partitioned Approximate Filters [Download Paper] (TUM), (TUM), (TU Munich) With today’s data deluge, approximate filters are particularly attractive to avoid expensive operations like remote data/disk accesses. Among the many filter variants available, it is non-trivial to find the most suitable one and its optimal configuration for a specific use-case. We provide open-source implementations for the most relevant filters (Bloom, Cuckoo, Morton, and Xor filters) and compare them in four key dimensions: the false-positive rate, space consumption, build, and lookup throughput.
We improve upon existing state-of-the-art implementations with a new optimization, radix-partitioning, boosting the build and lookup throughput for large filters by up to 9x and 5x. Our in-depth evaluation first studies the impact of all available optimizations separately before combining them to search for the optimal filter for specific use-cases. While register-blocked Bloom filters offer the highest throughput, the new Xor filters are best suited when optimizing for small filter sizes or low false-positive rates.

Unconstrained Submodular Maximization with Modular Costs: Tight Approximation and Application to Profit Maximization [Download Paper] (National University of Singapore), (City University of Hong Kong), (National University of Singapore), (The Hong Kong Polytechnic University), (Nanyang Technological University), (National University of Singapore) Given a set V, the problem of unconstrained submodular maximization with modular costs (USM-MC) asks for a subset S \subseteq V that maximizes f(S) - c(S), where f is a non-negative, monotone, and submodular function that gauges the utility of S, and c is a non-negative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks.
This paper presents ROI-Greedy, a polynomial time algorithm for USM-MC that returns a solution S satisfying f(S) - c(S) >= f(S*) - c(S*) - c(S*)\ln(f(S*)/c(S*)), where S* is the optimal solution to USM-MC. To our knowledge, ROI-Greedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worst-case guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S) - c(S) >= (1+\epsilon)(f(S*) - c(S*) - c(S*) \ln(f(S*)/c(S*)), for any \epsilon > 0. Further, we devise a non-trivial extension of ROI-Greedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROI-Greedy significantly outperforms competing methods in terms of the trade-off between efficiency and solution quality.

13:30 – 14:30 CESTResearch Session 34: Fairness & Explainability Chaired by Nargesian Fatemeh

Tailoring Data Source Distributions for Fairness-aware Data Integration [Download Paper] (University of Rochester), (University of Illinois at Chicago), (University of Michigan) Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: in other words, that it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required.
In this paper, we study how to acquire such data in the most cost-effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal cost. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.

Automated Feature Engineering for Algorithmic Fairness [Download Paper] (TU Berlin), (TU Berlin), (Leibniz Universität Hannover) One of the fundamental problems of machine ethics is to avoid the perpetuation and amplification of discrimination through machine learning applications. In particular, it is desired to exclude the influence of attributes with sensitive information, such as gender or race, and other causally related attributes on the machine learning task. The state-of-the-art bias reduction algorithm Capuchin breaks the causality chain of such attributes by adding and removing tuples. However, this horizontal approach can be considered invasive because it changes the data distribution. A vertical approach would be to prune sensitive features entirely. While this would ensure fairness without tampering with the data, it could also hurt the machine learning accuracy. Therefore, we propose a novel multi-objective feature selection strategy that leverages feature construction to generate more features that lead to both high accuracy and fairness. On three well-known datasets, our system achieves higher accuracy than other fairness-aware approaches while maintaining similar or higher fairness.

Explaining Monotonic Ranking Functions [Download Paper] (Rutgers University), (Rutgers) Ranking functions are commonly used to assist in decision-making in a wide variety of applications. As the general public realizes the significant societal impacts of the widespread use of algorithms in decisions, there has been a push towards explainability and transparency in decision processes and results, as well as demands to justify the fairness of the processes. In this paper, we focus on providing metrics towards explainability and transparency of ranking functions, with a focus towards making the the ranking process understandable, {\em a priori}, so that decision-makers can make informed choices when designing their ranking selection process. We propose transparent participation metrics to clarify the ranking process, by assessing the contribution of each parameter used in the ranking function in the creation of the final ranked outcome, using information about the ranking functions themselves, as well as observations of the underlying distributions of the parameter values involved in the ranking. To evaluate the outcome of the ranking process, we propose diversity and disparity metrics to measure how similar the selected objects are to each other, and to the underlying data distribution. We evaluate the behavior of our metrics on synthetic data, as well as on data and ranking functions on two real-world scenarios: high school admissions and decathlon scoring.

16:15 – 17:45 CESTResearch Session 35: Differential Privacy I Chaired by Wolfgang Lehner

Optimizing Fitness-For-Use of Differentially Private Linear Queries [Download Paper] (Pennsylvania State University), (Penn State), (Penn State), (Penn State), (Penn State) In practice, differentially private data releases are designed to support a variety of applications. A data release is fit for use if it meets target accuracy requirements for each application. In this paper, we consider the problem of answering linear queries under differential privacy subject to per-query accuracy constraints. Existing practical frameworks like the matrix mechanism do not provide such fine-grained control (they optimize total error, which allows some query answers to be more accurate than necessary, at the expense of other queries that become no longer useful). Thus, we design a fitness-for-use strategy that adds privacy-preserving Gaussian noise to query answers. The covariance structure of the noise is optimized to meet the fine-grained accuracy requirements while minimizing the cost to privacy.

Data Synthesis via Differentially Private Markov Random Field [Download Paper] (National University of Singapore), (University of Connecticut), (National Univ. of Singapore), (National University of Singapore) This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D, and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of count queries and classification tasks conducted on the synthetic data generated.

Differentially Private Binary- and Matrix-Valued Data Query: An XOR Mechanism [Download Paper] (Case Western Reserve University), (Case Western Reserve University), (University of Houston-Downtown), (Case Western Reserve University, Bilkent University), (Case Western Reserve University), (The University of Tennessee, Knoxville) Differential privacy has been widely adopted to release continuous- and scalar-valued information on a database without compromising the privacy of individual data records in it. The problem of querying binary- and matrix-valued information on a database in a differentially private manner has rarely been studied. However, binary- and matrix-valued data are ubiquitous in real-world applications, whose privacy concerns may arise under a variety of circumstances. In this paper, we devise an exclusive or (XOR) mechanism that perturbs binary- and matrix-valued query result by conducting an XOR operation on the query result with calibrated noises attributed to a matrix-valued Bernoulli distribution. We first rigorously analyze the privacy and utility guarantee of the proposed XOR mechanism. Then, to generate the parameters in the matrix-valued Bernoulli distribution, we develop a heuristic approach to minimize the expected square query error rate under 𝜖-differential privacy constraint. Additionally, to address the intractability of calculating the probability density function (PDF) of this distribution and efficiently generate samples from it, we adapt an Exact Hamiltonian Monte Carlo based sampling scheme. Finally, we experimentally demonstrate the efficacy of the XOR mechanism by considering binary data classification and social network analysis, all in a differentially private manner. Experiment results show that the XOR mechanism notably outperforms other state-of-the-art differentially private methods in terms of utility (such as classification accuracy and 𝐹1 score), and even achieves comparable utility to the non-private mechanisms.

Answering Multi-Dimensional Range Queries under Local Differential Privacy [Download Paper] (Beijing University of Posts and Telecommunications), (Purdue University), (Purdue University), (Beijing University of Posts and Telecommunications), (Beijing University of Posts and Telecommunications) In this paper, we tackle the problem of answering multi-dimensional range queries under local differential privacy. There are three key technical challenges: capturing the correlations among attributes, avoiding the curse of dimensionality, and dealing with the large domains of attributes. None of the existing approaches satisfactorily deals with all three challenges. Overcoming these three challenges, we first propose an approach called Two-Dimensional Grids (TDG). Its main idea is to carefully use binning to partition the two-dimensional (2-D) domains of all attribute pairs into 2-D grids that can answer all 2-D range queries and then estimate the answer of a higher dimensional range query from the answers of the associated 2-D range queries. However, in order to reduce errors due to noises, coarse granularities are needed for each attribute in 2-D grids, losing fine-grained distribution information for individual attributes. To correct this deficiency, we further propose Hybrid-Dimensional Grids (HDG), which also introduces 1-D grids to capture finer-grained information on distribution of each individual attribute and combines information from 1-D and 2-D grids to answer range queries. To make HDG consistently effective, we provide a guideline for properly choosing granularities of grids based on an analysis of how different sources of errors are impacted by these choices. Extensive experiments conducted on real and synthetic datasets show that HDG can give a significant improvement over the existing approaches.

CGM: An Enhanced Mechanism for Streaming Data Collectionwith Local Differential Privacy [Download Paper] (national university of singapore), (Hamad bin Khalifa University), (National University of Singapore), (Data Analytics and Intelligence Lab, Alibaba Group) Local differential privacy (LDP) is a well-established privacy protection scheme for collecting sensitive data, which has been integrated into major platforms such as iOS, Chrome, and Windows. The main idea is that each individual randomly perturbs her data on her local device, and only uploads the noisy version to an untrusted data aggregator. This paper focuses on the collection of streaming data consisting of regular updates, e.g., daily app usage. Such streams, when aggregated over a large population, often exhibit strong autocorrelations, e.g., the average usage of an app usually does not change dramatically from one day to the next. To our knowledge, this property has been largely neglected in existing LDP mechanisms. Consequently, data collected with current LDP methods often exhibit unrealistically violent fluctuations due to the added noise, drowning the overall trend, as shown in our experiments.
This paper proposes a novel correlated Gaussian mechanism (CGM) for enforcing (epsilon, delta)-LDP on streaming data collection, which reduces noise by exploiting public-known autocorrelation patterns of the aggregated data. This is done through non-trivial modifications to the core of the underlying Gaussian Mechanism; in particular, CGM injects temporally correlated noise, computed through an optimization program that takes into account the given autocorrelation pattern, data value range, and utility metric. CGM comes with formal proof of correctness, and consumes negligible computational resources. Extensive experiments using real datasets from different application domains demonstrate that CGM achieves consistent and significant utility gains compared to the baseline method of repeatedly running the underlying one-shot LDP mechanism.

16:15 – 17:45 CESTResearch Session 36: Graph Management II Chaired by Riccardo Tommasini

Minimum Vertex Augmentation [Download Paper] (Chinese University of Hong Kong), (The Chinese University of Hong Kong) This paper introduces a class of graph problems named minimum vertex augmentation (MVA). Given an input graph $G$ where each vertex carries a binary color 0 or 1, we want to flip the colors of the fewest 0-vertices such that the subgraph induced by all the (original and new) 1-vertices satisfies a user-defined predicate $\pi$. In other words, the goal is to minimally augment the subset of 1-vertices to uphold the property pi. Different formulations of $\pi$ instantiate the framework into concrete problems at the core of numerous applications. We first describe a suite of techniques for solving MVA problems with strong performance guarantees, and then present a generic algorithmic paradigm that a user can instantiate to deal with ad-hoc MVA problems. The effectiveness and efficiency of our solutions are verified with an extensive experimental evaluation.

Automating Incremental Graph Processing with Flexible Memoization [Download Paper] (NorthEastern University), (Alibaba Grioup), (Alibaba Group), (Alibaba Group), (NorthEastern University), (Alibaba Group), (NorthEastern University), (Northeast University), (Alibaba Group) The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph processing. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality.
In light of these challenges, we propose Ingress, an automated system for incremental graph processing. Ingress is able to incrementalize batch vertex-centric algorithms into their incremental counterparts as a whole, without the need of redesigned logic or data structures from users. Underlying Ingress is an automated incrementalization framework equipped with four different memoization policies, aiming to support all kinds of graph computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. Ingress chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, Ingress outperforms state-of-the-art incremental graph systems by 15.93× on average (up to 147.14×) in efficiency.

RapidMatch: A Holistic Approach to Subgraph Query Processing [Download Paper] (National University of Singapore), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (National University of Singapore) A subgraph query searches for all embeddings in a data graph that are identical to a query graph. Two kinds of algorithms, either graph exploration based or join based, have been developed for processing subgraph queries. Due to algorithmic and implementational differences, join-based systems can handle query graphs of a few vertices efficiently whereas exploration-based approaches typically process up to several tens of vertices in the query graph. In this paper, we first compare these two kinds of methods and prove that the complexity of result enumeration in state-of-the-art exploration-based methods matches that of the worst-case optimal join. Furthermore, we propose RapidMatch, a holistic subgraph query processing framework integrating the two approaches. Specifically, RapidMatch not only runs relational operators such as selections and joins, but also utilizes graph structural information, as in graph exploration, for filtering and join plan generation. Consequently, it outperforms the state of the art in both approaches on a wide range of query workloads.

Teseo and the Analysis of Structural Dynamic Graphs [Download Paper] (Centrum Wiskunde & Informatica), (CWI) We present Teseo, a new system for the storage and analysis of dynamic structural graphs in main-memory and the addition of transactional support. Teseo introduces a novel design based on sparse arrays, large arrays interleaved with gaps, and a fat tree, where the graph is ultimately stored. Our design contrasts with early systems for the analysis of dynamic graphs, which often lack transactional support and are anchored to a vertex table as a primary index. We claim that the vertex table implies several constraints, often neglected, that can actually impair the generality, the robustness and extension opportunities of these systems. We compare Teseo with other dynamic graph systems, showing a high resilience to workload and input changes, while achieving comparable, if not superior, throughputs in updates and latencies in raw scans.

ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms [Download Paper] (MIT CSAIL), (Massachusetts Institute of Technology), (MIT) Connected components is a fundamental kernel in graph applications. The fastest existing multicore algorithms for graph connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting.
We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 12.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. To guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies. Finally, we show how the techniques in ConnectIt can be used to speed up two important graph applications: approximate minimum spanning forest and SCAN clustering.

19Aug

09:00 – 10:30 CESTResearch Session 37: Timeseries Chaired by Steffen Zeuch

Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series [Download Paper] (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Intel Labs and MIT) Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon’s dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams [Download Paper] (University of Fribourg), (University of Fribourg), (University of Fribourg), (Exascale Infolab, Fribourg University) With the emergence of the Internet of Things (IoT), time series streams have become ubiquitous in our daily life. Recording such data is rarely a perfect process, as sensor failures frequently occur, yielding occasional blocks of data that go missing in multiple time series. These missing blocks do not only affect real-time monitoring but also compromise the quality of online data analyses. Effective streaming recovery (imputation) techniques either have a quadratic runtime complexity, which is infeasible for any moderately sized data, or cannot recover more than one time series at a time. In this paper, we introduce a new online recovery technique to recover multiple time series streams in linear time. Our recovery technique implements a novel incremental version of the Centroid Decomposition technique and reduces its complexity from quadratic to linear. Using this incremental technique, missing blocks are efficiently recovered in a continuous manner based on previous recoveries. We formally prove the correctness of our new incremental computation, which yields an accurate recovery. Our experimental results on real-world time series show that our recovery technique is, on average, 30% more accurate than the state of the art while being vastly more efficient.

Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries [Download Paper] (The Chinese University of HK), (The Chinese University of HK), (The Chinese University of Hong Kong) Performance-monitoring timeseries systems such as Prometheus and InfluxDB play a critical role in assuring reliability and operationality. These systems commonly adopt a column-oriented storage model, by which timeseries samples from different timeseries are separated, and all samples (with both numeric values and timestamps) in one timeseries are grouped into chunks and stored together. As a group of timeseries are often collected from the same source with the same timestamps, managing timestamps and metrics in a group manner provides more opportunities for query and insertion optimization but posts new challenges as well. Besides, for performance monitoring systems, to support better compression and efficient queries for most recent data that are most likely accessed by users, huge volumes of data are first cached in memory and then periodically flushed to disks. Periodical data flushing incurs high IO overhead, and simply discarding flushed data, which can still serve queries, not only is a waste but also brings huge memory reclamation cost. In this paper, we propose Heracles which integrates two techniques - (1) a new storage model, which enables efficient queries on compressed data by utilizing the shared timestamp column to easily locate corresponding metric values; (2) a novel two-level epoch-based memory manager, which allows the system to gradually flush and reclaim in-memory data while unreclaimed data can still serve queries. Heracles is implemented as a standalone module that can be easily integrated into existing performance monitoring timeseries systems. We have implemented a fully functional prototype with Heracles based on Prometheus tsdb, a representative open-source performance monitoring system, and conducted extensive experiments with real and synthetic timeseries data. Experimental results show that, compared with Prometheus, Heracles can improve the insertion throughput by 171%, and reduce the query latency and space usage by 32% and 30%, respectively, on average. Besides, to compare with other state-of-the-art storage techniques, we have integrated LevelDB (for LSM-tree-based structure) and Parquet (for column stores) into Prometheus tsdb, respectively, and experimental results show Heracles outperform these two integrations. We have released the open-source code of Heracles for public access.

FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data [Download Paper] (Alibaba), (Data Analytics and Intelligence Lab, Alibaba Group), (Alibaba), (Alibaba Group), (Renmin University of China), (Alibaba Group), (Alibaba Group) Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the part of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. We conduct experiments to evaluate our solution and compare it with alternatives on real data.

09:00 – 10:30 CESTResearch Session 38: Deep Learning II Chaired by Hazar Harmouch

DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks [Download Paper] (Bilkent University), (Bilkent University), (Bilkent University) Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using ad-hoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called DBTagger (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases.

Analyzing and Mitigating Data Stalls in DNN Training [Download Paper] (UT Austin), (Microsoft Research), (Microsoft), (UT Austin and VMWare) Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

Progressive Compressed Records: Taking a Byte out of Deep Learning Data [Download Paper] (Carnegie Mellon University), (Carnegie Mellon University), (Carnegie Mellon University) Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50\% of the original encoding for many training tasks; (ii) it is possible to automatically and efficiently select appropriate compression levels for a given task; and (iii) PCRs enable tasks to readily access compressed data at runtime---utilizing as little as half the training bandwidth and thus potentially doubling training speed.

Dealer: An End-to-End Model Marketplace with Differential Privacy [Download Paper] (Emory University/Georgia Institute of Technology), (Emory University), (Emory University), (Emory University), (Simon Fraser University), (CS) Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace perspective and propose the first en\textbf{\underline{D}}-to-end mod\textbf{\underline{e}}l m\textbf{\underline{a}}rketp\textbf{\underline{l}}ace with diff\textbf{\underline{e}}rential p\textbf{\underline{r}}ivacy (\emph{Dealer}) towards answering the following questions: \emph{How to formulate data owners' compensation functions and model buyers' price functions? How can the broker determine prices for a set of models to maximize the revenue with arbitrage-free guarantee, and train a set of models with maximum data coverage given a manufacturing budget to remain competitive}? For the former, we propose compensation function for each data owner based on Shapley value and privacy sensitivity, and price function for each model buyer based on data coverage sensitivity and noise sensitivity. Both privacy sensitivity and noise sensitivity are measured by the level of differential privacy. For the latter, we formulate two optimization problems for model pricing and model training, and propose efficient dynamic programming algorithms. Experiment results on the real breast cancer dataset and synthetic datasets justify the design of \emph{Dealer} and verify the efficiency and effectiveness of the proposed algorithms.

09:00 – 10:30 CESTResearch Session 39: Indices Chaired by Boris Novikov

Benchmarking Learned Indexes [Download Paper] (MIT), (MIT), (TUM), (TUM), (Intel), (TUM), (TUM), (MIT) Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [Download Paper] (MIT), (MIT), (MIT CSAIL), (MIT) Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

Cuckoo Index: A Lightweight Secondary Index Structure [Download Paper] (MIT), (Google), (RelationalAI), (Centrum Wiskunde & Informatica), (Carnegie Mellon University) In modern data warehousing, data skipping is essential for high query performance. While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns. Therefore, many systems rely on ap-proximate indexes such as min/max sketches (ZoneMaps) or Bloom ﬁlters for cost-effective data pruning. For example, Google PowerDrill skips more than 90% of data on average using such indexes. In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. With our approach, we tar-get equality predicates in a read-only (immutable) setting and optimize for space efficiency under the premise of practical build and lookup performance. In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in the data. CI allows to control the ratio of false positive partitions for lookups with non-occurring keys. Our experiments with real-world and synthetic data show that CI consumes significantly less space than per-partition ﬁlters for the same pruning power for low-to-medium cardinality columns. For high cardinality columns, CI is on par with its baselines.

Updatable Learned Index with Precise Positions [Download Paper] (Tsinghua University), ( Tsinghua University, China), (Chinese Academy of Sciences), (Tsinghua University), (UCLA), (Tsinghua University) Index plays an essential role in modern database engines to accelerate the query processing. The new paradigm of "learned index" has significantly changed the way of designing index structures in DBMS. The key insight is that indexes could be regarded as learned models that predict the position of a lookup key in the dataset. While such studies show promising results in both lookup time and index size, they cannot efficiently support update operations. Although recent studies have proposed some preliminary approaches to support update, they are at the cost of scarifying the lookup performance as they suffer from overheads brought by imprecise predictions in the leaf nodes. In this paper, we propose LIPP, a brand new framework of learned index to address such issues. Similar with state-of-the-art learned index structures, LIPP is able to support all kinds of index operations, namely lookup, insert, delete, update and bulkload. Meanwhile, we overcome the limitations of previous studies by properly extending the tree structure when dealing with update operations so as to eliminate the deviation of location predicted by models in the leaf nodes. Moreover, we further propose a dynamic adjustment strategy to ensure that the height of the tree index is tightly bounded and provide comprehensive theoretical analysis to illustrate it. We conduct an extensive set of experiments on several real-life datasets. The results demonstrate that our methods consistently outperform state-of-the-art solutions, achieving by up to 4 x for a broader class of workloads with different index operations.

Scalable Structural Index Construction for JSON Analytics [Download Paper] (University of California, Riverside), (Michigan Technological University), (University of California, Riverside) JavaScript Object Notation (JSON) and its variants have gained great popularity in recent years. Unfortunately, the performance of their analytics is often dragged down by the expensive JSON parsing. To address this, recent work has shown that building bitwise indices on JSON data, called structural indices, can greatly accelerate querying. Despite its promise, the existing structural index construction does not scale well as records become larger and more complex, due to its (inherently) sequential construction process and the involvement of costly memory copies that grow as the nesting level increases. To address the above issues, this work introduces Pison – a more memory-efficient structural index constructor with supports of intra-record parallelism. First, Pison features a redesign of the bottleneck step in the existing solution. The new design is not only simpler but more memory-efficient. More importantly, Pison is able to build structural indices for a single bulky record in parallel, enabled by a group of customized parallelization techniques. Finally, Pison is also optimized for better data locality, which is especially critical in the scenario of bulky record processing. Our evaluation using real-world JSON datasets shows that Pison achieves 9.8X speedup (on average) over the existing structural index construction solution for bulky records and 4.6X speedup (on average) of end-to-end performance (indexing plus querying) over a state-of-the-art SIMD-based JSON parser on a 16-core machine.

09:00 – 10:30 CESTResearch Session 40: Data Integration Chaired by Nargesian Fatemeh

Deep Entity Matching with Pre-Trained Language Models [Download Paper] (Megagon Labs), (Megagon Labs), (Megagon Labs), (University of Wisconsin-Madison), (Facebook AI) We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto’s matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Deep Learning for Blocking in Entity Matching: A Design Space Exploration [Download Paper] (QCRI), (Amazon Alexa AI), (Qatar Computing Research Institute, HBKU), (Qatar Computing Research Institute, HBKU), (UW - Madison), (University of Wisconsin-Madison), (American Family Insurance), (University of Wisconsin-Madison) Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL such as sequence modeling, transformer, and self supervision. We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

Dual-Objective Fine-Tuning of BERT for Entity Matching [Download Paper] (University of Mannheim), (University of Mannheim) An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

Discovering Related Data At Scale [Download Paper] (Microsoft), (Microsoft Research), (Microsoft Research), (Microsoft Research) Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise's data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model that determines whether two columns across two data streams are related or not. We then use the model to find relations at scale across tens of millions of column-pairs and thereafter construct a data relationship graph in a scalable fashion, processing a data lake that has 4.5 Petabytes of data in approximately 80 minutes. Using manually labeled datasets as ground-truth, we show that our techniques show improvements of at least 23% when compared to state-of-the-art methods.

09:00 – 10:30 CESTResearch Session 41: Graph Neural Networks Chaired by Michael Loster

Accelerating Large Scale Real-Time GNN Inference using Channel Pruning [Download Paper] (University of Southern California), (University of Southern California), (USC), (University of Southern California), (Unversity of Southern California) Graph Neural Networks (GNNs) are proven to be powerful models to generate node embedding for downstream applications. However, due to the high computation complexity of GNN inference, it is hard to deploy GNNs for large-scale or real-time applications. In this paper, we propose to accelerate GNN inference by pruning the dimensions in each layer with negligible accuracy loss. Our pruning framework uses a novel LASSO regression formulation for GNNs to identify feature dimensions (channels) that have high influence on the output activation. We identify two inference scenarios and design pruning schemes based on their computation and memory usage for each. To further reduce the inference complexity, we effectively store and reuse hidden features of visited nodes, which significantly reduces the number of supporting nodes needed to compute the target embedding. We evaluate the proposed method with the node classification problem on five popular datasets and a real-time spam detection application. We demonstrate that the pruned GNN models greatly reduce computation and memory usage with little accuracy loss. For full inference, the proposed method achieves an average of 3.27x speedup with only 0.002 drop in F1-Micro on GPU. For batched inference, the proposed method achieves an average of 6.67x speedup with only 0.003 drop in F1-Micro on CPU. To the best of our knowledge, we are the first to accelerate large scale real-time GNN inference through channel pruning.

Efficient Streaming Subgraph Isomorphism with Graph Neural Networks [Download Paper] (Ecole Polytechnique Federale de Lausanne), (Hanoi University of Science and Technology), (The University of Queensland), (Humboldt-Universität zu Berlin), (Griffith University), (EPFL) Queries to detect isomorphic subgraphs are important in graph-based data management. While the problem of subgraph isomorphism search has received considerable attention for the static setting of a single query, or a batch thereof, existing approaches do not scale to a dynamic setting of a continuous stream of queries. In this paper, we address the scalability challenges induced by a stream of subgraph isomorphism queries by caching and re-use of previous results. We first present a novel subgraph index based on graph embeddings that serves as the foundation for efficient stream processing. It enables not only effective caching and re-use of results, but also speeds-up traditional algorithms for subgraph isomorphism in case of cache misses. Moreover, we propose cache management policies that incorporate notions of reusability of query results. Experiments using real-world datasets demonstrate the effectiveness of our approach in handling isomorphic subgraph search for streams of queries.

Grain: Improving Data Efficiency of Graph Neural Networks via Diversified Influence Maximization [Download Paper] (Peking University), (Peking University), (Peking University), (Peking University), (Peking University), (Alibaba group), (Peking University) Data selection methods, such as active learning and core-set selection, are useful tools for improving the data efficiency of deep learning models on large-scale datasets. However, recent deep learning models have moved forward from independent and identically distributed data to graph-structured data, such as social networks, e-commerce user-item graphs, and knowledge graphs. This evolution has led to the emergence of Graph Neural Networks (GNNs) that go beyond the models existing data selection methods are designed for. Therefore, we present GRAIN, an efficient framework that opens up a new perspective through connecting data selection in GNNs with social influence maximization. By exploiting the common patterns of GNNs, GRAIN introduces a novel feature propagation concept, a diversified influence maximization objective with novel influence and diversity functions, and a greedy algorithm with an approximation guarantee into a unified framework. Empirical studies on public datasets demonstrate that GRAIN significantly improves both the performance and efficiency of data selection (including active learning and core-set selection) for GNNs. To the best of our knowledge, this is the first attempt to bridge two largely parallel threads of research, data selection, and social influence maximization, in the setting of GNNs, paving new ways for improving data efficiency.

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture [Download Paper] (University of Illinois at Urbana-Champaign), (University of Illinois at Urbana-Champaign), (University of Illinois at Urbana-Champaign), (University of Illinois at Urbana-Champaign), (IBM Thomas J. Watson Research Center), (NVIDIA), (University of Illinois at Urbana-Champaign), (NVIDIA Corporation) Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65-92\% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

Multi-Modal Transportation Recommendation with Unified Route Representation Learning [Download Paper] (Business Intelligence Lab, Baidu Research), (Baidu), (University of Central Florida), (Baidu Inc.), (Baidu Inc.), (Rutgers University) Multi-modal transportation recommendation aims to provide the most appropriate travel route with various transportation modes according to certain criteria. After analyzing large-scale navigation data, we find that route representations exhibit two patterns: spatio-temporal autocorrelations within transportation networks and the semantic coherence of route sequences. However, there are few studies that consider both patterns when developing multi-modal transportation systems. To this end, in this paper, we study multi-modal transportation recommendation with unified route representation learning by exploiting both spatio-temporal dependencies in transportation networks and the semantic coherence of historical routes. Specifically, we propose to unify both dynamic graph representation learning and hierarchical multi-task learning for multi-modal transportation recommendations. Along this line, we first transform the multi-modal transportation network into time-dependent multi-view transportation graphs and propose a spatiotemporal graph neural network module to capture the spatial and temporal autocorrelation. Then, we introduce a coherent-aware attentive route representation learning module to project arbitrary-length routes into fixed-length representation vectors, with explicit modeling of route coherence from historical routes. Moreover, we develop a hierarchical multi-task learning module to differentiate route representations for different transport modes, and this is guided by the final recommendation feedback as well as multiple auxiliary tasks equipped in different network layers. Extensive experimental results on two large-scale real-world datasets demonstrate the performance of the proposed system outperforms eight baselines.

11:00 – 12:30 CESTResearch Session 42: Storage and in-memory DBMS Chaired by Andreas Kipf

Taurus: Lightweight Parallel Logging for In-Memory Database Management Systems [Download Paper] (MIT), (University of Wisconsin-Madison), (Carnegie Mellon University), (MIT) Existing single-stream logging schemes are unsuitable for in-memory database management systems (DBMSs) as the single log is often a performance bottleneck. To overcome this problem, we present Taurus, an efficient parallel logging scheme that uses multiple log streams, and is compatible with both data and command logging. Taurus tracks and encodes transaction dependencies using a vector of log sequence numbers (LSNs). These vectors ensure that the de- pendencies are fully captured in logging and correctly enforced in recovery. Our experimental evaluation with an in-memory DBMS shows that Taurus’s parallel logging achieves up to 9.9× and 2.9× speedups over single-streamed data logging and command logging, respectively. It also enables the DBMS to recover up to 22.9× and 75.6× faster than these baselines for data and command logging, respectively. We also compare Taurus with two state-of-the-art parallel logging schemes and show that the DBMS achieves up to 2.8× better performance on NVMe drives and 9.2× on HDDs.

CoroBase: Coroutine-Oriented Main-Memory Database Engine [Download Paper] (Simon Fraser University), (Simon Fraser University), (Simon Fraser University) Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This paper presents CoroBase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2x better for read-intensive workloads, and remain competitive for workloads that inherently do not benefit from software prefetching.

Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs [Download Paper] (IBM Research - Zurich), (IBM Research - Zurich), (IBM Research - Zurich), (Isovalent) Solid-state drives (SSDs) are extensively used to deploy persistent data stores, as they provide low latency random access, high write throughput, high data density, and low cost. Tree-based data structures are widely used to build persistent data stores, and indeed they lie at the backbone of many of the data management systems used in production and research today. We show that benchmarking a persistent tree-based data structure on an SSD is a complex process, which may easily incur subtle pitfalls that can lead to an inaccurate performance assessment. At a high-level, these pitfalls stem from the interaction of complex software running on complex hardware. On the one hand, tree structures implement internal operations that have non-trivial effects on performance. On the other hand, SSDs employ firmware logic to deal with the idiosyncrasies of the underlying flash memory, which are well known to also lead to complex performance dynamics. We identify seven benchmarking pitfalls using RocksDB and WiredTiger, two widespread implementations of an LSM-Tree and a B+Tree, respectively. We show that such pitfalls can lead to incorrect measurements of key performance indicators, hinder the reproducibility and the representativeness of the results, and lead to suboptimal deployments in production environments. We also provide guidelines on how to avoid these pitfalls to obtain more reliable performance measurements, and to perform more thorough and fair comparisons among different design points.

Constructing and Analyzing the LSM Compaction Design Space [Download Paper] (Boston University), (Boston University), (Boston University), (Boston University) Log-structured merge (LSM) trees offer efficient ingestion by appending incoming data, and thus, are widely used as the storage layer of production NoSQL data stores. To enable competitive read performance, LSM-trees periodically re-organize data to form a tree with levels of exponentially increasing capacity, through iterative compactions. Compactions fundamentally influence the performance of an LSM-engine in terms of write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. Hence, choosing the appropriate compaction strategy is crucial and, at the same time, hard as the LSM-compaction design space is vast, largely unexplored, and has not been formally defined in the literature. As a result, most LSM-based engines use a fixed compaction strategy, typically hand-picked by an engineer, which decides how and when to compact data. In this paper, we present the design space of LSM-compactions, and evaluate state-of-the-art compaction strategies with respect to key performance metrics. Toward this goal, our first contribution is to introduce a set of four design primitives that can formally define any compaction strategy: (i) the compaction trigger, (ii) the data layout, (iii) the compaction granularity, and (iv) the data movement policy. Together, these primitives can synthesize both existing and completely new compaction strategies. Our second contribution is to experimentally analyze 10 compaction strategies. We present 12 observations and 7 high-level takeaway messages, which show how LSM systems can navigate the compaction design space.

11:00 – 12:30 CESTResearch Session 43: Representation Learning Chaired by Michael Loster

Scaling Attributed Network Embedding to Massive Graphs [Download Paper] (National University of Singapore), (The Hong Kong Polytechnic University), (National University of Singapore), (Hamad bin Khalifa University), (National University of Singapore), (Nanyang Technological University) Given a graph 𝐺 where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node 𝑣∈𝐺 to a compact vector 𝑋𝑣, which can be used in downstream machine learning tasks. Ideally, 𝑋𝑣 should capture node 𝑣’s affinity to each attribute, which considers not only 𝑣’s own attribute associations, but also those of its connected nodes along edges in 𝐺. It is challenging to obtain high-utility embeddings that enable accurate predictions; scaling effective ANE computation to massive graphs with millions of nodes pushes the difficulty of the problem to a whole new level. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. This paper proposes PANE, an effective and scalable approach to ANE computation for massive graphs that achieves state-of-the-art result quality on multiple benchmark datasets, measured by the accuracy of three common prediction tasks: attribute inference, link prediction, and node classification. In particular, for the large MAG data with over 59 million nodes, 0.98 billion edges, and 2000 attributes, PANE is the only known viable solution that obtains effective embeddings on a single server, within 12 hours. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel random walk model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. Extensive experiments, comparing 10 existing approaches on 8 real datasets, demonstrate that PANE consistently outperforms all existing methods in terms of result quality, while being orders of magnitude faster.

FREDE: Anytime Graph Embeddings [Download Paper] (University of Bonn), (Skoltech), (Aarhus University), (Aarhus University), (Skolkovo Institute of Science and Technology), (University of Bonn & Fraunhofer IAIS) Low-dimensional representations, or embeddings, of a graph’s nodes facilitate several practical data science and data engineering tasks. As such embeddings rely, explicitly or implicitly, on a similarity measure among nodes, they require the computation of a quadratic similarity matrix, inducing a tradeoff between space complexity and embedding quality. To date, no graph embedding work combines (i) linear space complexity, (ii) a nonlinear transform as its basis, and (iii) nontrivial quality guarantees. In this paper we introduce FREDE (FREquent Directions Embedding), a graph embedding based on matrix sketching that combines those three desiderata. Starting out from the observation that embedding methods aim to preserve the covariance among the rows of a similarity matrix, FREDE iteratively improves on quality while individually processing rows of a nonlinearly transformed PPR similarity matrix derived from a state-of-the-art graph embedding method and provides, at any iteration, column-covariance approximation guarantees in due course almost indistinguishable from those of the optimal approximation by SVD. Our experimental evaluation on variably sized networks shows that FREDE performs almost as well as SVD and competitively against state-of-the-art embedding methods in diverse data science tasks, even when it is based on as little as 10% of node similarities.

Tensors: An abstraction for general data processing [Download Paper] (ETHZ), (University of California, San Diego), (Microsoft), (Microsoft), (ETHZ), (Microsoft) Deep Learning (DL) has created a growing demand for simpler ways to develop complex models and efficient ways to execute them. Thus, a significant effort has gone into frameworks like Py- Torch or TensorFlow to support a variety of DL models and run efficiently and seamlessly over heterogeneous and distributed hard- ware. Since these frameworks will continue improving given the predominance of DL workloads, it is natural to ask what else can be done with them. This is not a trivial question since these frame- works are based on the efficient implementation of tensors, which are well adapted to DL but, in principle, to nothing else. In this paper we explore to what extent Tensor Computation Runtimes (TCRs) can support non-ML data processing applications, so that other use cases can take advantage of the investments made on TCRs. In particular, we are interested in graph processing and relational operators, two use cases very different from ML, in high demand, and complement quite well what TCRs can do today. Build- ing on Hummingbird, a recent platform converting traditional machine learning algorithms to tensor computations, we explore how to map selected graph processing and relational operator algorithms into tensor computations. Our vision is supported by the results: our code often outperforms custom-built C++ and CUDA kernels, while massively reducing the development effort, taking advantage of the cross-platform compilation capabilities of TCRs.

11:00 – 12:30 CESTResearch Session 44: Distributed Systems Chaired by Alberto Lerner

Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation [Download Paper] (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft -- GSL), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft) Microsoft Azure is dedicated to guarantee high quality of service to its customers, in particular, during periods of high customer activity, while controlling cost. We employ a Data Science (DS) driven solution to predict user load and leverage these predictions to optimize resource allocation. To this end, we built the Seagull infrastructure that processes per-server telemetry, validates the data, trains and deploys ML models. The models are used to predict customer load per server (24h into the future), and optimize service operations. Seagull continually re-evaluates accuracy of predictions, fallback to previously known good models and triggers alerts as appropriate. We deployed this infrastructure in production for PostgreSQL and MySQL servers across all Azure regions, and applied it to the problem of scheduling server backups during low-load time. This minimizes interference with user-induced load and improves customer experience.

ByShard: Sharding in a Byzantine Environment [Download Paper] (University of California Davis), (University of California, Davis) The emergence of blockchains has fueled the development of resilient systems that can deal with Byzantine failures due to crashes, bugs, or even malicious behavior. Recently, we have also seen the exploration of sharding in these resilient systems, this to provide the scalability required by very large data-based applications. Unfortunately, current sharded resilient systems all use system-specific specialized approaches toward sharding that do not provide the flexibility of traditional sharded data management systems. To improve on this situation, we fundamentally look at the design of sharded resilient systems. We do so by introducing ByShard, a unifying framework for the study of sharded resilient systems. Within this framework, we show how two-phase commit and two-phase locking---two techniques central to providing atomicity and isolation in traditional sharded databases---can be implemented efficiently in a Byzantine environment, this with a minimal-usage of costly Byzantine resilient primitives. Based on these techniques, we propose eighteen multi-shard transaction processing methods. Finally, we practically evaluate these methods and show that each method supports high transaction throughput and provides scalability while each striking its own trade-off between isolation level, latency, and abort rate. As such, our work provides a strong foundation for the development of ACID-compliant general-purpose and flexible sharded resilient data management systems.

MorphoSys: Automatic Physical Design Metamorphosis for Distributed Database Systems [Download Paper] (University of Waterloo), (University of Waterloo), (University of Waterloo) Distributed database systems are widely used to meet the demands of storing and managing computation-heavy work-loads. To boost performance and minimize resource and data contention, these systems require selecting a distributed physical design that determines where to place data, and which data items to replicate and partition. Deciding on a physical design is diﬃcult as each choice poses a trade-oﬀ in the design space, and a poor choice can signiﬁcantly de-grade performance. Current design decisions are typically static and cannot adapt to workload changes or are unable to combine multiple design choices such as data replication and data partitioning integrally. This paper presents Mor-phoSys, a distributed database system that dynamically chooses, and alters, its physical design based on the work-load. MorphoSys makes integrated design decisions for all of the data partitioning, replication and placement decisions on-the-ﬂy using a learned cost model. MorphoSys provides ef-ﬁcient transaction execution in the face of design changes via a novel concurrency control and update propagation scheme. Our experimental evaluation, using several benchmark work-loads and state-of-the-art comparison systems, shows that MorphoSys delivers excellent system performance through eﬀective and eﬃcient physical designs.

Towards Cost-Optimal Query Processing in the Cloud [Download Paper] (Friedrich-Alexander-Universität Erlangen-Nürnberg), (Uni Augsburg, LMU, TUM) Public cloud providers offer hundreds of heterogeneous hardware instances. For analytical query processing systems, this presents a major challenge: depending on the hardware configuration, performance and cost may differ by orders of magnitude. We propose a simple and intuitive model that takes the workload, hardware, and cost into account to determine the optimal instance configuration. We discuss how such a model-based approach can significantly reduce costs and also guide the evolution of cloud-native database systems to achieve our vision of cost-optimal query processing.

Crystal: A Unified Cache Storage System for Analytical Databases [Download Paper] (TUM), (Microsoft Research), (Microsoft Research) Cloud analytical databases employ a disaggregated storage model, where the elastic compute layer accesses data persisted on remote cloud storage in block-oriented columnar formats. Given the high latency and low bandwidth to remote storage and the limited size of fast local storage, caching data at the compute node is important and has resulted in a renewed interest in caching for analytics. Today, each DBMS builds its own caching solution, usually based on file- or block-level LRU. In this paper, we advocate a new architecture of a smart cache storage system called Crystal, that is co-located with compute. Crystal's clients are DBMS-specific "data sources" with push-down predicates. Similar in spirit to a DBMS, Crystal incorporates query processing and optimization components focusing on efficient caching and serving of single-table hyper-rectangles called regions. Results show that Crystal, with a small Spark data source connector, can significantly improve query latencies on unmodified Spark while also saving on bandwidth from remote storage.

11:00 – 12:30 CESTResearch Session 45: Encrypted Storage and Blockchain Chaired by Divesh Srivastava

CALYPSO: Private Data Management for Decentralized Ledgers [Download Paper] (IST Austria), (EPFL), (EPFL), (UCL), (Trinity College), (EPFL) Distributed ledgers provide high availability and integrity, making them a key enabler for practical and secure computation of distributed workloads among mutually distrustful parties. Many practical applications also require strong confidentiality, however. This work enhances permissioned and permissionless blockchains with the ability to manage confidential data without forfeiting availability or decentralization. The proposed CALYPSO architecture addresses two orthogonal challenges confronting modern distributed ledgers: (a) enabling the auditable management of secrets and (b) protecting distributed computations against arbitrage attacks when their results depend on the ordering and secrecy of inputs. CALYPSO introduces on-chain secrets, a novel abstraction that enforces atomic deposition of an auditable trace whenever users access confidential data. CALYPSO provides user-controlled consent management that ensures revocation atomicity and accountable anonymity. To enable permissionless deployment, we introduce an incentive scheme and provide users with the option to select their preferred trustees. We evaluated our CALYPSO prototype with a confidential document-sharing application and a decentralized lottery. Our benchmarks show that transaction-processing latency increases linearly in terms of security (number of trustees) and is in the range of 0.2 to 8 seconds for 16 to 128 trustees.

SlimChain: Scaling Blockchain Transactions through Off-Chain Storage and Parallel Processing [Download Paper] (Hong Kong Baptist University), (Hong Kong Baptist University), (Hong Kong Baptist University), (Simon Fraser University) Blockchain technology has emerged as the cornerstone of many decentralized applications operating among otherwise untrusted peers. However, it is well known that existing blockchain systems do not scale well. Transactions are often executed and committed sequentially in order to maintain the same view of the total order. Furthermore, it is necessary to duplicate both transaction data and their executions in every node in the blockchain network for integrity assurance. Such storage and computation requirements put significant burdens on the blockchain system, not only limiting system scalability but also undermining system security and robustness by making the network more centralized. To tackle these problems, in this paper, we propose SlimChain, a novel blockchain system that scales transactions through off-chain storage and parallel processing. Advocating a stateless design, SlimChain maintains only the short commitments of ledger states on-chain while dedicating transaction executions and data storage to off-chain nodes. To realize SlimChain, we propose new schemes for off-chain smart contract execution, on-chain transaction validation, and state commitment. We also propose optimizations to reduce network transmissions and a new sharding technique to improve system scalability further. Extensive experiments are conducted to validate the performance of the proposed SlimChain system. Compared with the existing systems, SlimChain reduces the on-chain storage requirements by 97%~99%, while also improving the peak throughput by 1.4X~15.6X.

Cryptanalysis of An Encrypted Database in SIGMOD '14 [Download Paper] (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University) Encrypted database is an innovative technology proposed to solve the data confidentiality issue in cloud-based DB systems. It allows a data owner to encrypt its database before uploading it to the service provider; and it allows the service provider to execute SQL queries over the encrypted data. Most of existing encrypted databases (e.g., CryptDB in SOSP '11) do not support data interoperability: unable to process complex queries that require piping the output of one operation to another. To the best of our knowledge, SDB (SIGMOD '14) is the only encrypted database that achieves data interoperability. Unfortunately, we found SDB is not secure! In this paper, we revisit the security of SDB and propose a cipertext-only attack named co-prime attack. It successfully attacks the common operations supported by SDB, including addition, comparison, sum, join and group-by. We evaluate our attack in three real-world benchmarks. For columns that support addition and comparison, we recover 84.9%-99.9% plaintexts. For columns that support sum, join and group-by, we recover 100% plaintexts. Besides, we provide potential countermeasures that can prevent the attacks against sum, join, group-by and addition. It is still an open problem to prevent the attack against comparison.

Software-Defined Data Protection: Low Overhead Policy Compliance at the Storage Layer is Within Reach! [Download Paper] (IT University Copenhagen), (UT Austin), (UT Austin and VMWare) Most modern data processing pipelines run on top of a distributed storage layer, and securing the whole system, and the storage layer in particular, against accidental or malicious misuse is crucial to ensuring compliance to rules and regulations. Enforcing data protection and privacy rules, however, stands at odds with the requirement to achieve higher and higher access bandwidths and processing rates in large data processing pipelines. In this work we describe our proposal for the path forward that reconciles the two goals. We call our approach Software-Defined Data Protection'' (SDP). Its premise is simple, yet powerful: decoupling often changing policies from request-level enforcement allows distributed smart storage nodes to implement the latter at line-rate. Existing and future data protection frameworks can be translated to the same hardware interface which allows storage nodes to offload enforcement efficiently both for company-specific rules and regulations, such as GDPR or CCPA. While SDP is a promising approach, there are several remaining challenges to making this vision reality. As we explain in the paper, overcoming these will require collaboration across several domains, including security, databases and specialized hardware design.

11:00 – 12:30 CESTResearch Session 46: Graph: Community Search Chaired by Laurent Bindschaedler

Butterfly-Core Community Search over Labeled Graphs [Download Paper] (Baidu), (Hong Kong Baptist University), (Baidu), (Baidu), (Rutgers University) Community search aims at finding densely connected subgraphs for query vertices in a graph. While this task has been studied widely in the literature, most of the existing works only focus on finding homogeneous communities rather than heterogeneous communities with different labels. In this paper, we motivate a new problem of cross-group community search, namely Butterfly-Core Community (BCC), over a labeled graph, where each vertex has a label indicating its properties and an edge between two vertices indicates their cross relationship. Specifically, for two query vertices with different labels, we aim to find a densely connected cross community that contains two query vertices and consists of butterfly networks, where each wing of the butterflies is induced by a k-core search based on one query vertex and two wings are connected by these butterflies. We first develop a heuristic algorithm achieving $2$-approximation to the optimal solution. Furthermore, we design fast techniques of query distance computations, leader pair identifications, and index-based BCC local explorations. Extensive experiments on seven real datasets and four useful case studies validate the effectiveness and efficiency of our BCC and its multi-labeled extension models.

Efficient Size-Bounded Community Search over Large Networks [Download Paper] (The University of Sydney), (The University of Sydney) The problem of community search, which aims to find a cohesive subgraph containing user-given query vertices, has been extensively studied recently. Most of the existing studies mainly focus on the cohesiveness of the returned community, while ignoring the size of the community, and may yield communities of very large size. However, many applications naturally require that the number of vertices/members in a community should fall within a certain range. In this paper, we design exact algorithms for the general size-bounded community search problem that aims to find a subgraph with the largest min-degree among all connected subgraphs that contain the query vertex $q$ and have at least $\ell$ and at most $h$ vertices, where $q, \ell, h$ are specified by the query. As the problem is NP-hard, we propose a branch-reduce-and-bound algorithm SC-BRB by developing nontrivial reducing techniques, upper bounding techniques, and branching techniques. Experiments on large real graphs show that SC-BRB on average increases the minimum degree of the community returned by the state-of-the-art heuristic algorithm GreedyF by a factor of $2.41$ and increases the edge density by a factor of $2.2$. In addition, SC-BRB is several orders of magnitude faster than a baseline approach, and all of our proposed techniques contribute to the efficiency of SC-BRB.

ICS-GNN: Lightweight Interactive Community Search via Graph Neural Network [Download Paper] (Peking University), (Peking University), (Alibaba Group), (The University of Southern Queensland) Searching a community containing a given query vertex in an online social network enjoys wide applications like recommendation, team organization, etc. When applied to real-life networks, the existing approaches face two major limitations. First, they usually take two steps, ie crawling a large part of the network first and then finding the community next, but the entire network is usually too big and most of the data are not interesting to end users. Second, the existing methods utilize hand-crafted rules to measure community membership, while it is very difficult to define effective rules as the communities are flexible for different query vertices. In this paper, we propose an Interactive Community Search method based on Graph Neural Network (shortened by ICS-GNN) to locate the target community over a subgraph collected on the fly from an online network. Specifically, we recast the community membership problem as a vertex classification problem using GNN, which captures similarities between the graph vertices and the query vertex by combining content and structural features seamlessly and flexibly under the guide of users' labeling. We then introduce a k-sized Maximum-GNN-scores (shortened by kMG) community to describe the target community. We next discover the target community iteratively and interactively. In each iteration, we build a candidate subgraph using the crawled pages with the guide of the query vertex and labeled vertices, infer the vertex scores with a GNN model trained on the subgraph, and discover the kMG community which will be evaluated by end users to acquire more feedback. Besides, two optimization strategies are proposed to combine ranking loss into the GNN model and search more space in the target community location. We conduct the experiments in both offline and online real-life data sets, and demonstrate that ICS-GNN can produce effective communities with low overhead in communication, computation, and user labeling.

Scalable Community Detection via Parallel Correlation Clustering [Download Paper] (MIT), (MIT CSAIL), (Google), (Google), (Google) Graph clustering and community detection are central in modern data mining. The increasing need for analyzing billion-scale data sets calls for faster and more scalable algorithms. There are certain trade-offs between quality and speed of such clustering algorithms. In this paper, we aim to design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework that encompass modularity and correlation clustering. This frameworks applies to LambdaCC objective (introduced by Veldt et al.) unifying several quality measures. Our framework consists of highly-optimized implementations that demonstrably scale to large data sets of up to billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.

Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach [Download Paper] (The University of Alabama at Birmingham), (The University of Alabama at Birmingham), (University of Waterloo), (University of Alabama), (The University of Alabama at Birmingham)

13:30 – 15:00 CESTResearch Session 47: Transactions and Queries Chaired by Boris Novikov

Database Isolation By Scheduling [Download Paper] (University of Wisconsin-Madison), (UW Madison), (UW - Madison) Transaction isolation is conventionally achieved by restricting access to the physical items in a database. To maximize performance, isolation functionality is often packaged with recovery, I/O, and data access methods in a monolithic transactional storage manager. While this design has historically afforded high performance in online transaction processing systems, industry trends indicate a growing need for a new approach in which intertwined components of the transactional storage manager are disaggregated into modular services. This paper presents a new method to modularize the isolation component. Our work builds on predicate locking, an isolation mechanism that enables this modularization by locking logical rather than physical items in a database. Predicate locking is rarely used as the core isolation mechanism because of its high theoretical complexity and perceived overhead. However, we show that this overhead can be substantially reduced in practice by optimizing for common predicate structures. We present DIBS, a transaction scheduler that employs our predicate locking optimizations to guarantee isolation as a modular service. We evaluate the performance of DIBS as the sole isolation mechanism in a data processing system. In this setting, DIBS scales up to 10.5 million transactions per second on a TATP workload. We also explore how DIBS can be applied to existing database systems to increase transaction throughput. DIBS reduces per-transaction file system writes by 90% on TATP in SQLite, resulting in a 3X improvement in throughput. Finally, DIBS reduces row contention on YCSB in MySQL, providing serializable isolation with a 1.4X improvement in throughput.

Epoch-based Commit and Replication in Distributed OLTP Databases [Download Paper] (MIT), (University of Wisconsin-Madison), (MIT), (MIT) Many modern data-oriented applications are built on top of distributed OLTP databases for both scalability and high availability. Such distributed databases enforce atomicity, durability, and consistency through two-phase commit (2PC) and synchronous replication at the granularity of every single transaction. In this paper, we present COCO, a new distributed OLTP database that supports epoch-based commit and replication. The key idea behind COCO is that it separates transactions into epochs and treats a whole epoch of transactions as the commit unit. In this way, the overhead of 2PC and synchronous replication is significantly reduced. We support two variants of optimistic concurrency control (OCC) using physical time and logical time with various optimizations, which are enabled by the epoch-based execution. Our evaluation on two popular benchmarks (YCSB and TPC-C) show that COCO outperforms systems with fine-grained 2PC and synchronous replication by up to a factor of four.

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats [Download Paper] (Massachusetts Institute of Technology), (Carnegie Mellon University), (Carnegie Mellon University), (Carnegie Mellon University), (Ursa Labs), (Carnegie Mellon University) The proliferation of modern data processing tools has given rise to open-source columnar data formats. These formats help organizations avoid repeated conversion of data to a new format for each application. However, these formats are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. As a result, DBMSs often fail to take advantage of full network bandwidth when transferring data. We aim to reduce or even eliminate this overhead by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. We implemented our storage engine based on the Apache Arrow format and integrated it into the NoisePage DBMS to evaluate our work. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods.

Permutable Compiled Queries: Dynamically Adapting Compiled Queries without Recompiling [Download Paper] (Carnegie Mellon Universiy), (Carnegie Mellon University), (Carnegie Mellon University), (Carnegie Mellon University), (Carnegie Mellon University) Just-in-time (JIT) query compilation is a common technique to improve analytical query performance in database management systems (DBMSs). However, the cost of compiling each query can be significant relative to its execution time. This overhead prohibits the DBMS from employing well-known adaptive query processing (AQP) methods to generate a new plan for a query if data distributions do not match the optimizer's estimations. The optimizer could eagerly generate multiple sub-plans for a query, but it can only include a few alternatives as each addition increases the query compilation time. We present a new method, called Permutable Compiled Queries (PCQ), that bridges the gap between JIT compilation and AQP. It allows the DBMS to modify compiled queries without needing to recompile the plan or including all possible variations before the query starts. With PCQ, the DBMS structures a query's code to include specialized data structures in the logic that facilitate dynamic changes. These structures provide an indirection layer that enables the DBMS to change the plan even while it is running. We implement PCQ in an in-memory DBMS and compare it against non-adaptive plans in a microbenchmark and against state-of-the-art analytic DBMSs. Our evaluation shows that PCQ outperforms static plans by more than 4x and yields better performance on an analytical benchmark by more than 2x against other DBMSs.

13:30 – 15:00 CESTResearch Session 48: Differential Privacy II Chaired by Divesh Srivastava

Real-World Trajectory Sharing with Local Differential Privacy [Download Paper] (University of Warwick), (University of Warwick), (University of Warwick), (AT&T Labs Research) Sharing trajectories is beneficial for many real-world applications, such as managing disease spread through contact tracing and tailoring public services to a population's travel patterns. However, public concern over privacy and data protection has limited the extent to which this data is shared. Local differential privacy enables data sharing in which users share a perturbed version of their data, but existing mechanisms fail to incorporate user-independent public knowledge (e.g., business locations and opening times, public transport schedules, geo-located tweets). This limitation makes mechanisms too restrictive, gives unrealistic outputs, and ultimately leads to low practical utility. To address these concerns, we propose a local differentially private mechanism that is based on perturbing hierarchically-structured, overlapping n-grams (i.e., contiguous subsequences of length n) of trajectory data. Our mechanism uses a multi-dimensional hierarchy over publicly available external knowledge of real-world places of interest to improve the realism and utility of the perturbed, shared trajectories. Importantly, including real-world public data does not negatively affect privacy or efficiency. Our experiments, using real-world data and a range of queries, each with real-world application analogues, demonstrate the superiority of our approach over a range of competing methods.

Budget Sharing for Multi-Analyst Differential Privacy [Download Paper] (Duke University), (Duke University), (Duke University), (Duke) Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to protect the privacy of individuals from whom they collect data. Differential Privacy (DP) provides a solution to release useful summary data while preserving privacy. However, most DP mechanisms are designed to answer a single set of queries and optimize the total accuracy. In reality, there are often multiple stakeholders that use a given data release and have overlapping but not-identical queries. This introduces a novel joint optimization problem in DP where the privacy budget must be shared among different analysts. In this work, we initiate study into the problem of DP query answering across multiple analysts. To capture the competing goals and priorities of multiple analysts, we formulate three desiderata that any mechanism should satisfy in this setting -- The Sharing Incentive, Non-Interference, and Workload Adaptivity -- while still optimizing for overall error. We demonstrate how existing DP query answering mechanisms in the multi-analyst settings fail to satisfy at least one of the desiderata. We present novel DP algorithms that provably satisfy all our desiderata and empirically show that they incur low error on realistic tasks.

Improving Utility and Security of the Shuffler-based Differential Privacy [Download Paper] (Purdue University), (Data Analytics and Intelligence Lab, Alibaba Group), (University of Chicago), (Alibaba Group), (Alibaba Group), (Alibaba Group), (Purdue University), (University of Wisconsin-Madison and XaiPient) When collecting information, local differential privacy (LDP) alle-viates privacy concerns of users because their private information is randomized before being sent it to the central aggregator. LDP imposes large amount of noise as each user executes the randomiza-tion independently. To address this issue, recent work introduced an intermediate server with the assumption that this intermediate server does not collude with the aggregator. Under this assump-tion, less noise can be added to achieve the same privacy guarantee as LDP, thus improving utility for the data collection task. This paper investigates this multiple-party setting of LDP. We analyze the system model and identify potential adversaries. We then make two improvements: a new algorithm that achieves a bet-ter privacy-utility tradeoff; and a novel protocol that provides better protection against various attacks. Finally, we perform experiments to compare different methods and demonstrate the benefits of using our proposed method.

Local Dampening: Differential Privacy for Non-numeric Queries via Local Sensitivity [Download Paper] (Universidade Federal do Ceara), (LSBD/UFC), (AT&T Labs Research), (LSBD/UFC), (AT&T Labs Research), (AT&T Labs Research) Differential privacy is the state-of-the-art formal definition for data release under strong privacy guarantees. A variety of mechanisms have been proposed in the literature for releasing the noisy output of numeric queries (e.g., using the Laplace mechanism), based on the notions of global sensitivity and local sensitivity. However, although there has been some work on generic mechanisms for releasing the output of non-numeric queries using global sensitivity (e.g., the Exponential mechanism), the literature lacks generic mechanisms for releasing the output of non-numeric queries using local sensitivity to reduce the noise in the query output. In this work, we remedy this shortcoming and present the local dampening mechanism. We adapt the notion of local sensitivity for the non-numeric setting and leverage it to design a generic nonnumeric mechanism. We illustrate the effectiveness of the local dampening mechanism by applying it to two diverse problems: (i) Influential node analysis. Given an influence metric, we release the top-k most influential nodes while preserving the privacy of the relationship between nodes in the network; (ii) Decision tree induction. We provide a private adaptation to the ID3 algorithm to build decision trees from a given tabular dataset. Experimental results show that we could reduce the use of privacy budget by 3 to 4 orders of magnitude for Influential node analysis and increase accuracy up to 12% for Decision tree induction when compared to global sensitivity based approaches.

Frequency Estimation under Local Differential Privacy [Download Paper] (University of Warwick), (University of Warwick), (University of Warwick) Private collection of statistics from a large distributed population is an important problem, and has led to large scale deployments from several leading technology companies. The dominant approach requires each user to randomly perturb their input, leading to guarantees in the local differential privacy model. In this paper, we place the various approaches that have been suggested into a common framework, and perform an extensive series of experiments to understand the tradeoffs between different implementation choices. Our conclusion is that for the core problems of frequency estimation and heavy hitter identification, careful choice of algorithms can lead to very effective solutions that scale to millions of users.

13:30 – 15:00 CESTResearch Session 49: Distributed Machine Learning Chaired by Matthias Boehm

Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches [Download Paper] (University of California, San Diego), (VMware), (Intuit), (VMware), (VMware), (VMware), (VMware), (University of California, San Diego) Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single best'' approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.

Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees [Download Paper] (Rice University), (Rice University), (Rice University), (Rice University) When numerical and machine learning (ML) computations are expressed relationally, classical query execution strategies (hash-based joins and aggregations) can do a poor job distributing the computation. In this paper, we propose a two-phase execution strategy for numerical computations that are expressed relationally, as aggregated join trees (that is, expressed as a series of relational joins followed by an aggregation). In a pilot run, lineage information is collected; this lineage is used to optimally plan the computation at the level of individual records. Then, the computation is actually executed. We show experimentally that a relational system making use of this two-phase strategy can be an excellent platform for distributed ML computations.

Tensor Relational Algebra for Distributed Machine Learning System Design [Download Paper] (Rice University), (Rice University), (Arizona State University), (Rice University), (Rice University), (Rice University) We consider the question: what is the abstraction that should be implemented by the computational engine of a machine learning system? Current machine learning systems typically push whole tensors through a series of compute kernels such as matrix multiplications or activation functions, where each kernel runs on an AI accelerator (ASIC) such as a GPU. This implementation abstraction provides little built-in support for ML systems to scale past a single machine, or for handling large models with matrices or tensors that do not easily fit into the RAM of an ASIC. In this paper, we present an alternative implementation abstraction called the tensor relational algebra (TRA). The TRA is a set-based algebra based on the relational algebra. Expressions in the TRA operate over binary tensor relations, where keys are multi-dimensional arrays and values are tensors. The TRA is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML workflows in distributed clusters.

ParaX: Boosting Deep Learning for Big Data Analytics on Many-Core CPUs [Download Paper] (NUDT), (NUDT), (NUDT), (NUDT), (Intel) Despite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current plat-forms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers. This paper presents a novel approach called ParaX, which boosts the performance of DL on many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultra-light scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. Extensive evaluation shows that ParaX achieves significant acceleration for DL on many-core CPUs.

Declarative Data Serving: The Future of Machine Learning Inference on the Edge [Download Paper] (University of Chicago), (Intel), (Intel), (U Chicago) Recent advances in computer architecture and networking have ushered in a new age of edge computing, where computation is placed close to the point of data collection to facilitate low-latency decision making. As the complexity of such deployments grow into networks of interconnected edge devices, getting the necessary data to be in `the right place at the right time'' can become a challenge. We envision a future of edge analytics where data flows between edge nodes are declaratively configured through high-level constraints. Using machine learning model-serving as a prototypical task, we illustrate how the heterogeneity and specialization of edge devices can lead to complex, task-specific communication patterns even in relatively simple situations. Without a declarative framework, managing this complexity will be challenging for developers and will lead to brittle systems. We conclude with a research vision for database community that brings our perspective to the emergent area of edge computing.

13:30 – 15:00 CESTResearch Session 50: Distributed Analytics Chaired by Haralampos Gavriilidis

Parallel Discrepancy Detection and Incremental Detection [Download Paper] (Univ. of Edinburgh), (Alibaba Group), (University of Edinburgh), (Alibaba Group) This paper studies how to catch duplicates, mismatches and conflicts in the same process. We adopt a class of entity enhancing rules that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We detect discrepancies as violations of such rules. We establish the complexity of discrepancy detection and incremental detection problems with the rules; they are both \NP-complete and \W[1]-hard. To cope with the intractability and scale with large datasets, we develop parallel algorithms and parallel incremental algorithms for discrepancy detection. We show that both algorithms are parallelly scalable, i.e., they guarantee to reduce runtime when more processors are used. Moreover, the parallel incremental algorithm is relatively bounded. The complexity bounds and algorithms carry over to denial constraints, a special case of the entity enhancing rules. Using real-life and synthetic datasets, we experimentally verify the effectiveness, scalability and efficiency of the algorithms.

Space- and Computationally-Efficient Set Reconciliation via Parity Bitmap Sketch (PBS) [Download Paper] (Facebook), (Peking University), (Georgia Institute of Technology), (Georgia Tech), (University of Miami), (Peking University) Set reconciliation is a fundamental algorithmic problem that arises in many networking, system, and database applications. In this problem, two large sets A and B of objects (bitcoins, files, records, etc.) are stored respectively at two different network-connected hosts, which we name Alice and Bob respectively. Alice and Bob communicate with each other to learn $A\Delta B$, the difference between A and B, and as a result the reconciled set $A\bigcup B$. Current set reconciliation schemes are based on either Invertible Bloom Filters (IBF) or Error-Correction Codes (ECC). The former has a low computational complexity of O(d), where d is the cardinality of $A\Delta B$, but has a high communication overhead that is several times larger than the theoretical minimum. The latter has a low communication overhead close to the theoretical minimum, but has a much higher computational complexity of $O(d^2)$. In this work, we propose Parity Bitmap Sketch (PBS), an ECC- based set reconciliation scheme that gets the better of both worlds: PBS has both a low computational complexity of O(d) just like IBF-based solutions and a low communication overhead of roughly twice the theoretical minimum. A separate contribution of this work is a novel rigorous analytical framework that can be used for the precise calculation of various performance metrics and for the near-optimal parameter tuning of PBS.

Elle: Inferring Isolation Anomalies from Experimental Observations [Download Paper] (UC Santa Cruz), (Jepsen) Users who care about their data store it in databases, which (at least in principle) guarantee some form of transactional isolation. However, experience shows that many databases do not provide the isolation guarantees they claim. With the recent proliferation of new distributed databases, demand has grown for checkers that can, by generating client workloads and injecting faults, produce anomalies that wit- ness a violation of a stated guarantee. An ideal checker would be sound (no false positives), efficient (polynomial in history length and concurrency), effective (finding viola- tions in real databases), general (analyzing many patterns of transactions), and informative (justifying the presence of an anomaly with understandable counterexamples). Sadly, we are aware of no checkers that satisfy these goals. We present Elle: a novel checker which infers an Adya- style dependency graph between client-observed transactions. It does so by carefully selecting database objects and oper- ations when generating histories, so as to ensure that the results of database reads reveal information about their ver- sion history. Elle can detect every anomaly in Adya et al’s formalism (except for predicates), discriminate between them, and provide concise explanations of each. This paper makes the following contributions: we present Elle, demonstrate its soundness over specific datatypes, measure its efficiency against the current state of the art, and give evidence of its effectiveness via a case study of four real databases.

Language-Agnostic Integrated Queries in a Managed Polyglot Runtime [Download Paper] (Università della Svizzera italiana (USI)), (Oracle Labs), (Università della Svizzera italiana (USI)) Language-integrated query (LINQ) frameworks offer a convenient programming abstraction for processing in-memory collections of data, allowing developers to concisely express declarative queries using general-purpose programming languages. Existing LINQ frameworks rely on the well-defined type system of statically-typed languages such as C♯ or Java to perform query compilation and execution. As a consequence of this design, they do not support dynamic languages such as Python, R, or JavaScript. Such languages are however very popular among data scientists, who would certainly benefit from LINQ frameworks in data analytics applications. In this work we bridge the gap between dynamic languages and LINQ frameworks. We introduce DynQ, a novel query engine designed for dynamic languages. DynQ is language-agnostic, since it is able to execute SQL queries in a polyglot language runtime. Moreover, DynQ can execute queries combining data from multiple sources, namely in-memory object collections as well as on-file data and external database systems. Our evaluation of DynQ shows performance comparable with equivalent hand-optimized code, and in line with common data-processing libraries and embedded databases, making DynQ an appealing query engine for standalone analytics applications and for data-intensive server-side workloads.

Lachesis: Automatic Partitioning for UDF-Centric Analytics [Download Paper] (Arizona State University), (Arizona State University), (Arizona State University), (IBM T.J. Watson Research Center), (Rice University), (Rice University), (Rice University) Partitioning is effective in avoiding expensive shuffling operations. However, it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions (UDFs), where sub-computations are hard to be reused for partitionings compared to relational applications. In addition, functional dependency that is widely utilized for partitioning selection is often unavailable in the unstructured data that is ubiquitous in UDF-centric analytics. We propose the Lachesis system, which represents UDF-centric workloads as workflows of analyzable and reusable sub-computations. Lachesis further adopts a deep reinforcement learning model to infer which sub-computations should be used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications to improve the performance and users' productivity.

13:30 – 15:00 CESTResearch Session 51: Query Execution Chaired by Steffen Zeuch

Beyond Equi-joins: Ranking, Enumeration and Factorization [Download Paper] (Northeastern University), (Northeastern University), (Northeastern University) We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k, the k top-ranked answers are returned in O(n polylog n + k log k) time. This is within a polylogarithmic factor of the best known complexity for equi-joins and even of O(n+k), the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections, such as the so-called free-connex queries. Remarkably, they hold even when the entire output is of size n^ℓ for a join of ℓ relations. The key ingredient is a novel O(n polylog n)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

A Practical Approach to Groupjoin and Nested Aggregates [Download Paper] (TUM), (TUM) Groupjoins, the combined execution of a join and a subsequent group by, are common in analytical queries, and occur in about 1/8 of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention, which limits their usefulness in a many-core system. Having an efficient implementation of groupjoins is highly desirable, as groupjoins are not only used to fuse group by and join but are also introduced by the unnesting component of the query optimizer to avoid nested-loops evaluation of aggregates. Furthermore, the query optimizer needs be able to reason over the result of aggregation in order to schedule it correctly. Traditional selectivity and cardinality estimations quickly reach their limits when faced with computed columns from nested aggregates, which leads to poor cost estimations and thus, suboptimal query plans. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose two novel techniques, aggregate estimates to predict the result distribution of aggregates, and parallel groupjoin execution for a scalable execution of groupjoins. The resulting system has significantly better estimates and a contention-free evaluation of groupjoins, which can speed up some TPC-H queries up to a factor of 2.

Stacked Filters: Learning to Filter by Structure [Download Paper] (Harvard University), (Harvard University), (Harvard) We present Stacked Filters, a new probabilistic filter which is fast and robust similar to query-agnostic filters (such as Bloom and Cuckoo filters), and at the same time brings low false positive rates and sizes similar to classifier-based filters (such as Learned Filters). The core idea is that Stacked Filters incorporate workload knowledge about frequently queried non-existing values. Instead of learning, they structurally incorporate that knowledge using hashing and several sequenced filter layers, indexing both data and frequent negatives. Stacked Filters can also gather workload knowledge on-the-fly and adaptively build the filter. We show experimentally that for a given memory budget, Stacked Filters achieve end-to-end query throughput up to 130x better than the best alternative for a workload, either query-agnostic or classifier-based filters, and depending on where data is (SSD or HDD).

Charting the Design Space of Query Execution using VOILA [Download Paper] (CWI), (CWI) Database architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming processes that requires manually crafting prototypes with a low chance of generating material insight. We propose a framework that aims to accelerate this exploration process significantly. Our framework enables synthesizing many different engines from a description in a carefully designed domain-specific language (VOILA). We explain basic concepts and formally define the semantics of VOILA. We demonstrate VOILA's flexibility by presenting translation back-ends that allow the synthesis of state-of-the-art paradigms (data-centric compilation, vectorized execution, AVX-512), mutations and mixes thereof. We show-case VOILA's flexibility by exploring the query engine design space in an automated fashion. We generated thousands of query engines and report our findings. Queries generated by VOILA achieve similar performance as state-of-the-art hand-optimized implementations and are up to 35.5x faster than well-known systems.

13:30 – 15:00 CESTResearch Session 52: Graph Processing II Chaired by Riccardo Tommasini

Columnar Storage and List-based Processing for Graph Database Management Systems [Download Paper] (University of Waterloo), (University of Waterloo), (University of Waterloo) We revisit column-oriented storage and query processing techniques in the context of contemporary graph database management systems (GDBMSs). Similar to column-oriented RDBMSs, GDBMSs support read-heavy analytical work-loads that however have fundamentally different data access patterns than traditional analytical workloads. We first derive a set of desiderata for optimizing storage and query processors of GDBMS based on their access patterns. We then present the design of columnar storage, compression, and query processing techniques based on these desiderata. In addition to showing direct integration of existing techniques from columnar RDBMSs, we also propose novel ones that are optimized for GDBMSs. These include a novel list-based query processor, which avoids expensive data copies of traditional block-based processors under many-to-many joins and avoids materializing adjacency lists in intermediate tuples, a new data structure we call single-indexed edge property pages and an accompanying edge ID scheme, and a new application of Jacobson’s bit vector index for compressing NULL values and empty lists. We integrated our techniques into the GraphflowDB in-memory GDBMS. Through extensive experiments, we demonstrate the scalability and query performance benefits of our techniques.

On Querying Historical K-Cores [Download Paper] (UNSW), (University of Technology Sydney), (UTS), (University of Technology Sydney), (University of New South Wales), (University of New South Wales) Many real-world relationships between entities can be modeled as temporal graphs, where each edge is associated with a timestamp or a time interval representing its occurrence. K-core is a fundamental model used to capture cohesive subgraphs in a simple graph and have drawn much research attention over the last decade. Despite widespread research, none of the existing works support the efficient querying of historical k-cores in temporal graphs. In this paper, given an integer k and a time window, we study the problem of computing all k-cores in the graph snapshot over the time window. We propose an index-based solution and several pruning strategies to reduce the index size. We also design a novel algorithm to construct this index, whose running time is linear to the final index size. Lastly, we conducted extensive experiments on several real-world temporal graphs to show the high effectiveness of our index-based solution.

An Experimental Evaluation and Guideline for Path Finding in Weighted Dynamic Network [Download Paper] (The University of Queensland), (University of Queensland), (The Hong Kong University of Science and Technology) The shortest path computation is a building block of various network applications. Because the real-life networks are evolving as time passes by, Dynamic Shortest Path (DSP) problem has drawn lots of research attention in recent years. However, as DSP has many factors related to network topology, update pattern, and query characteristics, while the existing works only test their algorithms on very limited situations and comparisons, it is still hard to choose the most suitable method during implementation. To this end, we first identify the determinant dimensions and constraint dimensions of DSP problem and create a complete problem space to cover all possible situations. Then we evaluate the state-of-the-art DSP methods under the same implementation standard and test them systematically under a set of synthetic dynamic networks. Furthermore, we propose the dynamic degree to classify the dynamic environments and use throughput evaluate their dynamic performance. These results serve as the guidelines to find the best solution for each situation during system implementation, and also identify the research opportunities. Finally, we validate our findings in the real-life dynamic networks.

Materializing Knowledge Bases via Trigger Graphs [Download Paper] (Samsung AI Research), (LIRMM, Inria, University of Montpellier, CNRS), (University of Bologna), (Vrije Universiteit Amsterdam) The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs) for tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of the rules avoiding redundant computations. We present the results of an extensive theoretical and empirical study that seeks to answer when and how TGs can be computed and what are the benefits of TGs when applied over real-world KBs. Our results include introducing algorithms that compute (minimal) TGs. We implemented our approach in a new engine, called GLog, and our experiments show that it can be significantly more efficient than the chase enabling us to materialize Knowledge Graphs with 17B facts in less than 40 min using a single machine with commodity hardware.