VLDB 2021: Research Track Papers

This page lists the Research Sessions and the corresponding papers that will be presented in the conference. Note that all the following papers have been published in the Proceedings of VLDB. VLDB 2021 includes the papers published in PVLDB Vol. 13, No. 13 (roll-over) and the papers published in PVLDB Vol. 14, No. 1 - 12.

All times are given to the Copenhagen local timezone at the conference time (CEST).

17Aug

09:00 – 10:30 CESTResearch Session 1: Persistent Memory I Chaired by Peter Boncz

Approaching DRAM performance by using microsecond-latency flash memory for small-sized random read accesses: a new access method and its graph applications [Download Paper] Tomoya Suzuki (Kioxia Corporation), Kazuhiro Hiwada (Kioxia Corporation), Hirotsugu Kajihara (Kioxia Corporation), Shintaro Sano (Kioxia Corporation), Shuou Nomura (Kioxia Corporation), Tatsuo Shiozawa (Kioxia Corporation) For applications in which small-sized random accesses frequently occur for datasets that exceed DRAM capacity, placing the datasets on SSD can result in poor application performance. For the read-intensive case we focus on in this paper, low latency flash memory with microsecond read latency is a promising solution. However, when they are used in large numbers to achieve high IOPS (Input/Output operations Per Second), the CPU processing involved in IO requests is an overhead. To tackle the problem, we propose a new access method combining two approaches: 1) optimizing issuance and completion of the IO requests to reduce the CPU overhead. 2) utilizing many contexts with lightweight context switches by stackless coroutines. These reduce the CPU overhead per request to less than 10 nsec, enabling read access with DRAM-like overhead, while the access latency longer than DRAM can be hidden by the context switches. We apply the proposed method to graph algorithms such as BFS (Breadth First Search), which involves many small-sized random read accesses. In our evaluation, the large graph data is placed on microsecond-latency flash memories within prototype boards, and it is accessed by the proposed method. As a result, for the synthetic and real-world graphs, the execution times of the graph algorithms are 88-141% of those when all the data are placed in DRAM.

Optimizing In-memory Database Engine For AI-powered On-line Decision Augmentation Using Persistent Memory [Download Paper] Cheng Chen (4Paradigm Inc.), Jun Yang (4Paradigm Inc.), mian lu (4Paradigm Inc.), taize wang (4Paradigm Inc.), zhao zheng (4Paradigm Inc.), Yuqiang Chen (4th Paradigm), Wenyuan Dai (4Paradigm Inc.), Bingsheng He (National University of Singapore), Weng-Fai Wong (National University of Singapore), Guoan Wu (Intel), Yuping Zhao (Intel), Andy Rudoff (Intel) On-line decision augmentation (OLDA) has been considered as a promising paradigm for real-time decision making powered by Artificial Intelligence (AI). OLDA has been widely used in many applications such as real-time fraud detection, personalized recommendation, etc. On-line inference puts real-time features extracted from multiple time windows through a pre-trained model to evaluate new data to support decision making. Feature extraction is usually the most time-consuming operation in many OLDA data pipelines. In this work, we started by studying how existing in-memory databases can be leveraged to efficiently support such real-time feature extractions. However, we found that existing in-memory databases cost hundreds or even thousands of milliseconds. This is unacceptable for OLDA applications with strict realtime constraints. We therefore propose FEDB (Feature Engineering Database), a distributed in-memory database system designed to efficiently support on-line feature extraction. Our experimental results show that FEDB can be one to two orders of magnitude faster than the state-of-the-art in-memory databases on real-time feature extraction. Furthermore, we explore the use of the Intel Optane DC Persistent Memory Module (PMEM) to make FEDB more cost-effective. When comparing the proposed PMEM-optimized persistent skiplist to the FEDB using DRAM+SSD, PMEM-based FEDB can shorten the tail latency up to 19.7%, reduce the recovery time up to 99.7%, and save up to 58.4% total cost of a real OLDA pipeline.

Zen: a High-Throughput Log-Free OLTP Engine for Non-Volatile Main Memory [Download Paper] Gang Liu (Chinese Academy of Sciences), Leying Chen (Chinese Academy of Sciences), Shimin Chen (Chinese Academy of Sciences) Emerging Non-Volatile Memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of magnitude slower than DRAM and only supports blocked-oriented access have changed. NVMs are byte-addressable and almost as fast as DRAM. The capacity of NVM is much (4-16x) larger than DRAM. Such NVM characteristics make it possible to build OLTP database entirely in NVM main memory.
This paper studies the structure of OLTP engines with hybrid NVM and DRAM memory. We observe three challenges to design an OLTP engine for NVM: tuple metadata modifications, NVM write redundancy, and NVM space management. We propose Zen, a high-throughput log-free OLTP engine for NVM. Zen addresses the three design challenges with three novel techniques: metadata enhanced tuple cache, log-free persistent transactions, and light-weight NVM space management. Experimental results on a real machine equipped with Intel Optane DC Persistent Memory show that Zen achieves up to 10.1x improvement compared with existing solutions to run an OLTP database as large as the size of NVM while achieving fast failure recovery.

Revisiting the Design of LSM-tree Based OLTP Storage Engine with Persistent Memory [Download Paper] Baoyue Yan (Beihang University), Xuntao Cheng (AZFT), Bo Jiang (Beihang University), Shibin Chen (AZFT), Canfang Shang (AZFT), Jianying Wang (AZFT), kenry huang (alibaba), Xinjun Yang (AZFT), Wei Cao (AZFT), Feifei Li (Alibaba) The recent byte-addressable and large-capacity commercialized persistent memory (PM) is promising to drive database as a service (DBaaS) into unchartered territories. This paper investigates how to leverage PMs to revisit the conventional LSM-tree based OLTP storage engines designed for DRAM-SSD hierarchy for DBaaS instances. Specifically, we (1) propose a light-weight PM allocator named Halloc customized for LSM-tree, (2) build a high-performance Semi-persistent Memtable utilizing the persistent in-memory writes of PM, (3) design a concurrent commit algorithm named Reorder Ring to aschieve log-free transaction processing for OLTP workloads and (4) present a Global Index as the new globally sorted persistent level with non-blocking in-memory compaction. The design of Reorder Ring and Semi-persistent Memtable achieves fast writes without synchronized logging overheads and achieves near instant recovery time. Moreover, the design of Semi-persistent Memtable and Global Index with in-memory compaction enables the byte-addressable persistent levels in PM, which significantly reduces the read and write amplification as well as the background compaction overheads. The overall evaluation shows that the performance of our proposal over PM-SSD hierarchy outperforms the baseline by up to 3.8x in YCSB benchmark and by 2x in TPC-C benchmark.

SaS: SSD as SQL Database System [Download Paper] Jong-Hyeok Park (Sungkyunkwan University), Soyee Choi (SungKyunKwan University), Gihwan Oh (Sungkyunkwan University), Sang Won Lee (Sungkyunkwan University) Every database engine runs on top of an operating system in the host, strictly separated with the storage. This more-than-half-century-old IHDE (In-Host-Database-Engine) architecture, however, reveals its limitations when run on fast flash memory SSDs. In particular, the IO stacks incur significant run-time overhead and also hinder vertical optimizations between database engines and SSDs. In this paper, we envisage a new database architecture, called SaS (SSD as SQL database engine), where a full-blown SQL database engine runs inside SSD, tightly integrated with SSD architecture without intervening kernel stacks. As IO stacks are removed, SaS is free from their run-time overhead and further can explore numerous vertical optimizations between database engine and SSD. SaS evolves SSD from dummy block device to database server with SQL as its primary interface. The benefit of SaS will be more outstanding in the data centers where the distance between database engine and the storage is ever widening because of virtualization, storage disaggregation, and open software stacks. The advent of computational SSDs with more compute resource will enable SaS to be more viable and attractive database architecture.

09:00 – 10:30 CESTResearch Session 2: Data Mining Chaired by Martin Hentschel

Comprehensive and Efficient Workload Compression [Download Paper] Shaleen Deep (University of Wisconsin-Madison), Anja Gruenheid (Google Inc.), Paraschos Koutris (University of Wisconsin-Madison), Jeff Naughton (Google), Stratis Viglas (University of Edinburgh) This work studies the problem of constructing a representative workload from a given input workload where the former serves as an approximation with guarantees of the latter. We discuss our work in the context of workload analysis and monitoring. As an example, evolving system usage patterns in a database system can cause load imbalance and performance regressions which can be controlled by monitoring system usage patterns, i.e., a representative workload, over time. In order to construct such a workload in a principled manner, we formalize the notion of workload representativity and coverage. These metrics capture the intuition that the distribution of features in a compressed workload should match a target distribution, increasing representativity, and include common queries as well as outliers, increasing coverage. We show that solving this problem optimally is NP-Hard and present a novel greedy algorithm that provides approximation guarantees. We compare our techniques to established algorithms in this problem space such as sampling and clustering, and demonstrate advantages and key trade-offs of our techniques.

On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection [Download Paper] Sheng Wang (New York University), Yuan Sun (Monash University), Zhifeng Bao (RMIT University) This paper presents a thorough evaluation of the existing methods that accelerate Lloyd’s algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. Our UniK embraces a class of well-known methods and enables a fine-grained performance breakdown of existing methods. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we also investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.

Comprehensible Counterfactual Explanation on Kolmogorov-Smirnov Test [Download Paper] Zicun Cong (Simon Fraser University), Lingyang Chu (McMaster University), Yu Yang (City University of Hong Kong), Jian Pei (Simon Fraser University) The Kolmogorov-Smirnov (KS) test is popularly used in many applications, such as anomaly detection, astronomy, database security and AI systems. One challenge remained untouched is how we can obtain an explanation on why a test set fails the KS test. In this paper, we tackle the problem of producing counterfactual explanations for test data failing the KS test. Concept-wise, we propose the notion of most comprehensible counterfactual explanations, which accommodates both the KS test data and the user domain knowledge in producing explanations. Computation-wise, we develop an efficient algorithm MOCHE (for MOst CompreHensible Explanation) that avoids enumerating and checking an exponential number of subsets of the test set failing the KS test. MOCHE not only guarantees to produce the most comprehensible counterfactual explanations, but also is orders of magnitudes faster than the baselines. Experiment-wise, we present a systematic empirical study on a series of benchmark real datasets to verify the effectiveness, efficiency and scalability of most comprehensible counterfactual explanations and MOCHE.

GeCo: Quality Counterfactual Explanations in Real Time [Download Paper] Maximilian Schleich (University of Washington), Zixuan Geng (University of Washington), Yihong Zhang (University of Washington), Dan Suciu (University of Washington) Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because of the inherent tension between a rich semantics of the domain, and the need for real time response. In this paper we present GeCo, the first system that can compute plausible and feasible counterfactual explanations in real time. At its core, GeCo relies on a genetic algorithm, which is customized to favor searching counterfactual explanations with the smallest number of changes. To achieve real-time performance, we introduce two novel optimizations: $\Delta$-representation of candidate counterfactuals, and partial evaluation of the classifier. We compare empirically GeCo against five other systems described in the literature, and show that it is the only system that can achieve both high quality explanations and real time answers.

A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [Download Paper] Peng Cheng (East China Normal University), Jiabao Jin (East China Normal University), Lei Chen (Hong Kong University of Science and Technology), Xuemin Lin (University of New South Wales), Libin Zheng (Sun Yat-sen University) With the rapid development of smart mobile devices, the car-hailing platforms (e.g., Uber or Lyft) have attracted much attention from both the academia and the industry. In this paper, we consider an important dynamic car-hailing problem, namely maximum revenue vehicle dispatching (MRVD), in which rider requests dynamically arrive and drivers need to serve as many riders as possible such that the entire revenue of the platform is maximized. We prove that the MRVD problem is NP-hard and intractable. In addition, the dynamic car-hailing platforms have no information of the future riders, which makes the problem even harder. To handle the MRVD problem, we propose a queueing-based vehicle dispatching framework, which first uses existing machine learning algorithms to predict the future vehicle demand of each region, then estimates the idle time periods of drivers through a queueing model for each region. With the information of the predicted vehicle demands and estimated idle time periods of drivers, we propose two batch-based vehicle dispatching algorithms to efficiently assign suitable drivers to riders such that the expected overall revenue of the platform is maximized during each batch processing. Through extensive experiments, we demonstrate the efficiency and effectiveness of our proposed approaches over both real and synthetic datasets. In summary, our methods can achieve 5~10% increase on overall revenue without sacrifice on running speed compared with existing solutions.

09:00 – 10:30 CESTResearch Session 3: Sketches Chaired by Thomas Neumann

On the algebra of data sketches [Download Paper] Jakub Lemiesz (Wrocław University of Science and Technology) We consider the problem of designing a distributed data sketch for scenario in which data stream is observed by many independent network nodes. We require that a sketch apart from being computationally and memory efficient should also be mergeable in a way that mimics set theory operations on related data sets. For example, when monitoring network traffic, one may consider how many distinct packets passed through a given node (sum of sets for different time windows) or passed through two given nodes (intersection of sets from two locations) and what is their total size (intersection of weighted sets).
In this paper we propose a sketch that allows to efficiently summarize sets constructed from a sequence of set theory operations. We also provide an analytical control over the trade-off between the accuracy and storage/computational requirements. In comparison to the previous works the proposed solution 1) allows the weights of elements, 2) allows performing set theory operations simultaneous on a large number of sketches, 3) does not require computationally expensive numerical calculations and guarantees low overheads.

SetSketch: Filling the Gap between MinHash and HyperLogLog [Download Paper] Otmar Ertl (Dynatrace Research) MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

SKT: A One-Pass Multi-Sketch Data Analytics Accelerator [Download Paper] Monica Chiosa (ETH Zurich), Thomas Preußer (Accemic Technologies), Gustavo Alonso (ETHZ) Data analysts often need to characterize a data stream as a basic first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter characterizing the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the growing specialization of the underlying compute infrastructure by hardware accelerators, this paper proposes SKT, an FPGA-based bump-in-the-wire accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes streams at TCP/IP line rates of 100 Gbps and is built to fit emerging cloud service architectures, such as Microsoft’s Catapult or Amazon’s AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. It demonstrates by extensive experimentation how the FPGA-accelerated SKT implementation achieves a significant performance gain over high-end, server-class CPUs.

KLL±: Approximate Quantile Sketches over Dynamic Datasets [Download Paper] Fuheng Zhao (UCSB), Sujaya A Maiyya (University Of California, Santa Barbara), Ryan Weiner (UCSB), Divy Agrawal (University of California, Santa Barbara), Amr El Abbadi (UC Santa Barbara) Recently the long standing problem of optimal construction of quantile sketches was resolved by Karnin, Lang, and Liberty in what they propose as KLL sketch (FOCS 2016). The algorithm for KLL is only online updatable for insert operations, but not for delete operations. For tasks like data partitioning through quantile range, it is necessary to support delete operations. When the data set is updated dynamically, i.e., when data elements are inserted and deleted, the quantile sketch should reflect the changes. In this paper, we propose KLL± -- the first quantile approximation algorithm to account for both inserts and deletes in the data steam. KLL± extends the functionality of KLL sketch to support arbitrary updates with small space overhead. The space bound for KLL± is O(α^(1.5)/ε log2log(1/εδ)), where ε and δ are constants that determine precision and failure probability, and α bounds the number of deletions with respect to insert operations. The experimental evaluation of KLL± highlights that with minimal space overhead, KLL± achieves comparable accuracy in quantile approximation as compared to KLL.

On-Off Sketch: A Fast and Accurate Sketch on Persistence [Download Paper] Yinda Zhang (Peking University), Jinyang Li (Peking University), Yutian Lei (Xiangtan University), Tong Yang (Peking University), Zhetao Li (湘潭大学), Gong Zhang (Huawei), Bin Cui (Peking University) Approximate stream processing has attracted much attention recently. Prior art mostly focuses on characteristics like frequency, cardinality, and quantile. Persistence, as a new characteristic, is getting increasing attention. Unlike frequency, persistence highlights behaviors where an item appears recurrently in many time windows of a data stream. There are two typical problems with persistence – persistence estimation and finding persistent items. In this paper, we propose the On-Off sketch to address both problems. For persistence estimation, using the characteristic that the persistence of an item is increased periodically, we compress increments when multiple items are mapped to the same counter, which significantly reduces the error. Compared with the Count-Min sketch, 1) in theory, we prove that the error of the On-Off sketch is always smaller; 2) in experiments, the On-Off sketch achieves around 6.17 times smaller error and 2.2 times higher throughput. For finding persistent items, we propose a technique to separate persistent and non-persistent items, further improving the accuracy. We show that the space complexity of our On-Off sketch is much better than the state-of-the-art (PIE), and it reduces the error up to 4 orders of magnitude and achieves 2.84 times higher throughput than prior algorithms in experiments.

09:00 – 10:30 CESTResearch Session 4: Data Preparation Chaired by Damianos Chatziantoniou

RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation [Download Paper] Nan Tang (Qatar Computing Research Institute, HBKU), Ju Fan (Renmin University of China), Fangyi Li (Renmin University of China), Jianhong Tu (Renmin University of China), Xiaoyong Du (Renmin University of China), Guoliang Li (Tsinghua University), Samuel Madden (MIT), Mourad Ouzzani (Qatar Computing Research Institute, HBKU) Can AI help automate human-easy but computer-hard data preparation tasks that currently heavily involve data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional en- coder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data prepara- tion tasks, such as value normalization, data transformation, data annotation, etc. Beyond RPT, we also discuss several appealing techniques for data preparation, e.g., collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we also identify activities that will unleash a series of research opportunities to advance the field of data preparation.

LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization [Download Paper] Yiming Lin (University of California, Irvine), Daokun Jiang (University of California, Irvine), Roberto Yus (UC Irvine), Georgios Bouloukakis (Telecom SudParis), Andrew Chio (university of California Irvine), Sharad Mehrotra (U.C. Irvine), Nalini Venkatasubramanian (University of California, Irvine) This paper explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points (APs), each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER (LOCATER), postulates semantic localization as a series of data cleaning tasks - first, it treats the problem of determining the AP to which a device is connected between any two of its connection events as a missing value detection and repair problem. It then associates the device with the semantic subregion (e.g., a conference room in the region) by postulating it as a location disambiguation problem. LOCATER uses a bootstrapping semi-supervised learning method for coarse localization and a probabilistic method to achieve finer localization. The paper shows that LOCATER can achieve significantly high accuracy at both the coarse and fine levels.

CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties [Download Paper] Yinjun Wu (University of Pennsylvania), James Weimer (University of Pennsylvania), Susan B Davidson (University of Pennsylvania) High-quality labels are expensive to obtain for many machine learning tasks, such as medical image classification tasks. Therefore, probabilistic (weak) labels produced by weak supervision tools are used to seed a process in which influential samples with weak labels are identified and cleaned by several human annotators to improve the model performance. To lower the overall cost and computational overhead of this process, we propose a solution called Chef(CHEap and Fast label cleaning), which consists of the following three components. First, to reduce the cost of human annotators, we use Infl, which prioritizes the most influential training samples for cleaning and provides cleaned labels to save the cost of one human annotator. Second, to accelerate the sample selector phase and the model constructor phase, we use Increm-Infl to incrementally produce influential samples, and DeltaGrad-L to incrementally update the model. Third, we redesign the typical label cleaning pipeline so that human annotators iteratively clean smaller batch of samples rather than one big batch of samples. This yields better over all model performance and enables possible early termination when the expected model performance has been achieved. Extensive experiments show that our approach gives good model prediction performance while achieving significant speed-ups.

Horizon: Scalable Dependency-driven Data Cleaning [Download Paper] El Kindi Rezig (MIT), Mourad Ouzzani (Qatar Computing Research Institute, HBKU), Walid G. Aref (Purdue University), Ahmed Elmagarmid (QCRI), Ahmed Mahmood (Purdue University), Michael Stonebraker (MIT) An often-cited figure reports that data scientists spend at least 80% of their time preparing and cleaning data. As a result, the need for effective and scalable data cleaning systems has never been more pressing. A large class of data repair algorithms rely on integrity constraints to detect and repair errors. A well-studied class of such constraints is Functional Dependencies (FDs, for short). Although there has been an increased interest in developing general data cleaning systems for a myriad of data errors, scalability has been left behind. This is because current systems assume data cleaning is done offline and in one iteration. However, developing data science pipelines is highly iterative and requires efficient cleaning techniques to scale to millions of records in seconds/minutes, not days. In our efforts to re-think the data cleaning stack and bring it to the era of data science, we introduce Horizon, an end-to-end FD repair system to address two key challenges: (1) Accuracy: most existing FD repair techniques aim to produce repairs that minimize changes to the data. However, this may lead to incorrect combinations of attribute values (or patterns). Horizon smartly leverages the interaction between data patterns induced by various FDs and subsequently selects repairs that preserve the most frequent patterns found in the original data, leading to a better repair accuracy. (2) Scalability: existing data cleaning systems struggle when dealing with large-scale real-world datasets. Horizon features a linear-time repair algorithm that can scale to millions of records and is orders of magnitude faster than state-of-the-art cleaning algorithms. We benchmark Horizon and state-of-the-art cleaning systems on multiple datasets and metrics and show that Horizon consistently outperforms existing techniques both in terms of repair quality and scalability.

Approximating Median Absolute Deviation with Bounded Error [Download Paper] Zhiwei Chen (Tsinghua University), Shaoxu Song (Tsinghua University), Ziheng Wei (Huawei Technologies Co., Ltd.), Jingyun Fang (Huawei Technologies Co., Ltd.), Jiang Long (Huawei Technologies Co., Ltd.) The median absolute deviation (MAD) is a statistic measuring the variability of a set of quantitative elements. It is known to be more robust to outliers than the standard deviation (SD), and thereby widely used in outlier detection. Computing the exact MAD however is costly, e.g., by calling an algorithm of finding median twice, with space cost O(n) over n elements in a set. In this paper, we propose the first fully mergeable approximate MAD algorithm, OP- MAD, with one-pass scan of the data. Remarkably, by calling the proposed algorithm at most twice, namely TP-MAD, it guarantees to return an (e, 1)-accurate MAD, i.e., the error relative to the exact MAD is bounded by the desired e or 1. The space complexity is reduced to O(m) while the time complexity is O(n+mlogm), where m is the size of the sketch used to compress data, related to the desired error bound e. To get a more accurate MAD, i.e., with smaller e, the sketch size m will be larger, a trade-off between effectiveness and efficiency. In practice, we often have the sketch size m ≪ n, leading to constant space cost O(1) and linear time cost O(n). The extensive experiments over various datasets demon- strate the superiority of our solution, e.g., 160000× less memory and 18× faster than the aforesaid exact method in datasets p areto and norm. Finally, we further implement and evaluate the parallelizable TP-MAD in Apache Spark, and the fully mergeable OP-MAD in Structured Streaming.

09:00 – 10:30 CESTResearch Session 5: Graph Analysis Chaired by Philippe Cudre-Mauroux

On Analyzing Graphs with Motif-Paths [Download Paper] Xiaodong Li (The University of Hong Kong), Reynold Cheng (The University of Hong Kong, China), Kevin Chen-Chuan Chang (University of Illinois at Urbana-Champaign), Caihua Shan (The University of Hong Kong), Chenhao Ma (The University of Hong Kong), Hongtai Cao (University of Illinois at Urbana-Champaign) Path-based solutions have been shown to be useful for various graph analysis tasks, such as link prediction and graph clustering. However, they are no longer adequate for handling complex and gigantic graphs. Recently, motif-based analysis has attracted a lot of attention. A motif, or a small graph with a few nodes, is often considered as a fundamental unit of a graph. Motif-based analysis captures high-order structure between nodes, and performs better than traditional ``edge-based'' solutions. In this paper, we study motif-path, which is conceptually a concatenation of one or more motif instances. We examine how motif-paths can be used in three path-based mining tasks, namely link prediction, local graph clustering and node ranking. We further address the situation when two graph nodes are not connected through a motif-path, and develop a novel defragmentation method to enhance it. Experimental results on real graph datasets demonstrate the use of motif-paths and defragmentation techniques improves graph analysis effectiveness.

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra [Download Paper] Maciej Besta (ETH Zurich), Zur Vonarburg-Shmaria (ETH Zurich), Yannick Schaffner (ETH Zurich), Leonardo Schwarz (ETH Zurich), Grzegorz Kwasniewski (ETH Zurich), Lukas Gianinazzi (ETH Zurich), Jakub Beranek (VSB), Kacper Janda (AGH-UST), Tobias Holenstein (ETH), Sebastian Leisinger (ETHZ), Peter Tatkowski (ETH), Esref Özdemir (ETH Zürich), Adrian Balla (ETH Zurich), Marcin Copik (ETH Zurich), Philipp Lindenberger (ETH Zurich), Marek Konieczny (AGH-UST), Onur Mutlu (ETH Zurich), Torsten Hoefler (ETH Zurich) We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by >2x), maximal clique listing (by >9x), k-clique listing (by up to 1.1x), and subgraph isomorphism (by 2.5x), also obtaining better theoretical performance bounds.

A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search [Download Paper] Mengzhao Wang (Hangzhou Dianzi University), Xiaoliang Xu (Hangzhou Dianzi University), Qiang Yue (Hangzhou Dianzi University), Yuxiang Wang (Hangzhou Dianzi University) Approximate nearest neighbor search (ANNS) constitutes an important operation in a multitude of applications, including recommendation systems, information retrieval, and pattern recognition. In the past decade, graph-based ANNS algorithms have been the leading paradigm in this domain, with dozens of graph-based ANNS algorithms proposed. Such algorithms aim to provide effective, efficient solutions for retrieving the nearest neighbors for a given query. Nevertheless, these efforts focus on developing and optimizing algorithms with different approaches, so there is a real need for a comprehensive survey about the approaches' relative performance, strengths, and pitfalls. Thus here we provide a thorough comparative analysis and experimental evaluation of 13 representative graph-based ANNS algorithms via a new taxonomy and fine-grained pipeline. We compared each algorithm in a uniform test environment on eight real-world datasets and 12 synthetic datasets with varying sizes and characteristics. Our study yields novel discoveries, offerings several useful principles to improve algorithms, thus designing an optimized method that outperforms the state-of-the-art algorithms. This effort also helped us pinpoint algorithms' working portions, along with rule-of-thumb recommendations about promising research directions and suitable algorithms for practitioners in different fields.

Accelerating Exact Constrained Shortest Paths on GPUs [Download Paper] Shengliang Lu (National University of Singapore), Bingsheng He (National University of Singapore), Yuchen Li (Singapore Management University), Hao Fu (Tianjin University) The recently emerging applications such as software-defined networks and autonomous vehicles require efficient and exact solutions for constrained shortest paths (CSP), which finds the shortest path in a graph while satisfying some user-defined constraints. Compared with the common shortest path problems without constraints, CSP queries have a significantly larger number of subproblems. The most widely used labeling algorithm becomes prohibitively slow and impractical. Other existing approaches tend to find approximate solutions and build costly indices on graphs for fast query processing, which are not suitable for emerging applications with the requirement of exact solutions. A natural question is whether and how we can efficiently find the exact solution for CSP.
In this paper, we propose Vine, a framework that parallelizes the labeling algorithm to efficiently find the exact CSP solution using GPUs. The major challenge addressed in Vine is how to deal with a large number of subproblems that are mostly unpromising but require a significant amount of memory and computational resources. Our solution is twofold. First, we develop a two-level pruning approach to eliminate the subproblems by making good use of the GPU's hierarchical memory. Second, we propose an adaptive parallelism control model based on the observations that the degree of parallelism (DOP) is the key to performance optimization with the given amount of computational resources. Extensive experiments show that Vine achieves 18x speedup on average over the widely adopted CPU-based solution running on 40 CPU threads. Vine also has over 5x speedup compared with a GPU approach that statically controls the DOP. Compared to the state-of-the-art approximate solution with preprocessed indices, Vine provides exact results with competitive or even better performance.

Fast Algorithm for Anchor Graph Hashing [Download Paper] Yasuhiro Fujiwara (NTT Communication Science Laboratories), Sekitoshi Kanai (NTT Software Innovation Center), Yasutoshi Ida (NTT Software Innovation Center), Atsutoshi Kumagai (NTT Software Innovation Center), Naonori Ueda (NTT Communication Science Labs.) Anchor graph hashing is used in many applications such as cancer detection, web page classification, and drug discovery. It computes the hash codes from the eigenvectors of the matrix representing the similarities between data points and anchor points; anchors refer to the points representing the data distribution. In performing an approximate nearest neighbor search, the hash codes of a query data point are determined by identifying its closest anchor points. Anchor graph hashing, however, incurs high computation cost since (1) the computation cost of obtaining the eigenvectors is quadratic to the number of anchor points, and (2) the similarities of the query data point to all the anchor points must be computed. Our proposal, Tridiagonal hashing, increases the efficiency of anchor graph hashing because of its two advances: (1) we apply a graph clustering algorithm to compute the eigenvectors from the tridiagonal matrix obtained from the similarities between data points and anchor points, and (2) we detect anchor points closest to the query data point by using a dimensionality reduction approach. Experiments show that our approach is several orders of magnitude faster than the previous approaches. Besides, it yields high search accuracy than the original anchor graph hashing approach.

11:00 – 12:00 CESTResearch Session 6: Memory Management Chaired by Ippokratis Pandis

Breaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems [Download Paper] Chen Luo (Snowflake Inc.), Michael Carey (UC Irvine) Log-Structured Merge-trees (LSM-trees) have been widely used in modern NoSQL systems. Due to their out-of-place update design, LSM-trees have introduced memory walls among the memory components of multiple LSM-trees and between the write memory and the buffer cache. Optimal memory allocation among these regions is non-trivial because it is highly workload-dependent. Existing LSM-tree implementations instead adopt static memory allocation schemes due to their simplicity and robustness, sacrificing performance. In this paper, we attempt to break down these memory walls in LSM-based storage systems. We first present a memory management architecture that enables adaptive memory management. We then present a partitioned memory component structure with new flush policies to better exploit the write memory to minimize the write cost. To break down the memory wall between the write memory and the buffer cache, we further introduce a memory tuner that tunes the memory allocation between these two regions. We have conducted extensive experiments in the context of Apache AsterixDB using the YCSB and TPC-C benchmarks and we present the results here.

Achieving High Throughput and Elasticity in a Larger-than-Memory Store [Download Paper] Chinmay Kulkarni (University of Utah), Badrish Chandramouli (Microsoft Research), Ryan Stutsman (University of Utah) Millions of sensors, mobile applications and machines now generate billions of events. Specialized many-core key-value stores (KVSs) can ingest and index these events at high rates (over 100 Mops/s on one machine) if events are generated on the same machine; however, to be practical and cost-effective they must ingest events over the network and scale across cloud resources elastically.
We present Shadowfax, a new distributed KVS based on FASTER, that transparently spans DRAM, SSDs, and cloud blob storage while serving 130 Mops/s/VM over commodity Azure VMs using conventional Linux TCP. Beyond high single-VM performance, Shadowfax uses a unique approach to distributed reconfiguration that avoids any server-side key ownership checks or cross-core coordination both during normal operation and migration. Hence, Shadowfax can shift load in 17 s to improve system throughput by 10 Mops/s with little disruption. Compared to the state-of-the-art, it has 8× better throughput (than Seastar+memcached) and avoids costly I/O to move cold data during migration. On 12 machines, Shadowfax retains its high throughput to perform 930 Mops/s, which, to the best of our knowledge, is the highest reported throughput for a distributed KVS used for large-scale data ingestion and indexing.

Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation [Download Paper] Yingqiang Zhang (Alibaba Group), Chaoyi Ruan (USTC), Cheng Li (USTC), Jimmy Yang (Alibaba Group), Wei Cao (Alibaba), Feifei Li (Alibaba Group), Bo Wang (Alibaba Group), Jing Fang (Alibaba Group), Yuhui Wang (Alibaba Group), Jingze Huo (USTC), Chao Bi (USTC) It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase’s system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99×-2.33×, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87× and 5.48× speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.

11:00 – 12:00 CESTResearch Session 7: User Interfaces Chaired by Eleni Tzirita Zacharatou

Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks [Download Paper] Zifeng Yuan (NTU), Huey Eng CHUA (Nanyang Technological University), Sourav S Bhowmick (Nanyang Technological University), Zekun Ye (Fudan University), Wook-Shin Han (POSTECH), Byron Choi (Hong Kong Baptist University) Canned patterns (i.e., small subgraph patterns) in visual graph query interfaces (a.k.a GUI) facilitate efficient query formulation by enabling pattern-at-a-time construction mode. However, existing GUIs for querying large networks either do not expose any canned patterns or if they do then they are typically selected manually based on domain knowledge. Unfortunately, manual generation of canned patterns is not only labor intensive but may also lack diversity for supporting efficient visual formulation of a wide range of subgraph queries. In this paper, we present a novel generic and extensible framework called TATTOO that takes a data-driven approach to automatically selecting canned patterns for a GUI
from large networks. Specifically, it first decomposes the underlying network into truss-infested and truss-oblivious regions. Then candidate canned patterns capturing different real-world query topologies are generated from these regions. Canned patterns based on a user-specified plug are then selected for the GUI from these candidates by maximizing coverage and diversity, and by minimizing the cognitive load of the pattern set. Experimental studies with real-world datasets demonstrate the benefits of TATTOO. Importantly, this work takes a concrete step towards realizing plug-and-play visual graph query interfaces for large networks.

PATSQL: Efficient Synthesis of SQL Queries from Example Tables with Quick Inference of Projected Columns [Download Paper] Keita Takenouchi (NTT DATA), Takashi Ishio (Nara Institute of Science and Technology), Joji Okada (NTT DATA), Yuji Sakata (NTT DATA) SQL is one of the most popular tools for data analysis, and it is now used by an increasing number of users without having expertise in databases. Several studies have proposed programming-by-example approaches to help such non-experts to write correct SQL queries. While existing methods support a variety of SQL features such as aggregation and nested query, they suffer a significant increase in computational cost as the scale of example tables increases. In this paper, we propose an efficient algorithm utilizing properties known in relational algebra to synthesize SQL queries from input and output tables. Our key insight is that a projection operator in a program sketch can be lifted above other operators by applying transformation rules in relational algebra, while preserving the semantics of the program. This enables a quick inference of appropriate columns in the projection operator, which is an essential component in synthesis but causes combinatorial explosions in prior work. We also introduce a novel form of constraints and its top-down propagation mechanism for efficient sketch completion. We implemented this algorithm in our tool PATSQL and evaluated it on 226 queries from prior benchmarks and Kaggle’s tutorials. As a result, PATSQL solved 67% of the benchmarks and found 90% of the solutions within a second. Our tool is available at https://naist-se.github.io/patsql/.

COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics [Download Paper] Tarique Siddiqui (Microsoft Research), Surajit Chaudhuri (Microsoft), Vivek Narasayya (Microsoft) Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today’s database systems, user-defined functions (UDF), as well as middleware solutions that compare subsets outside the databases.

11:00 – 12:00 CESTResearch Session 8: Queries Chaired by Sven Groppe

Rumble: Data Independence for Large Messy Data Sets [Download Paper] Ingo Müller (ETH Zürich), Ghislain Fourny (ETH Zurich), Stefan Irimescu (Beekeeper AG), Can Cikis (ETH Zurich), Gustavo Alonso (ETHZ) This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized language specifically designed for querying JSON documents. The key challenge in the design and implementation of Rumble is mapping the recursive structure of JSON documents and JSONiq queries onto Spark's execution primitives based on tabular data frames. Our solution is to translate a JSONiq expression into a tree of iterators that dynamically switch between local and distributed execution modes depending on the nesting level. By overcoming the impedance mismatch in the engine, Rumble frees the user from solving the same problem for every single query, thus increasing their productivity considerably. As we show in extensive experiments, Rumble is able to scale to large and complex data sets in the terabyte range with a similar or better performance than other engines. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.

Scalable Querying of Nested Data [Download Paper] Jaclyn Smith (Oxford University), Michael Benedikt (Oxford University), Milos Nikolic (University of Edinburgh), Amir Shaikhha (University of Edinburgh) While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.

Procedural Extensions of SQL: Understanding their usage in the wild [Download Paper] Surabhi Gupta (Microsoft Research India), Karthik Ramachandra (Microsoft Research India) Procedural extensions of SQL have been in existence for many decades now. However, little is known about their magnitude of usage and their complexity in real-world workloads. Procedural code executing in a RDBMS is known to have inefficiencies and limitations; as a result there have been several efforts to address this problem. However, the lack of understanding of their use in real workloads makes it challenging to (a) motivate new work in this area, (b) identify research challenges and opportunities, and (c) demonstrate impact of novel work. We aim to address these challenges with our work.
In this paper, we present the results of our in-depth analysis of thousands of stored procedures, user-defined functions and triggers taken from several real workloads. We introduce SQL-ProcBench, a benchmark for procedural workloads in RDBMSs. SQL-ProcBench has been created using the insights derived from our analysis, and thus represents real workloads. Using SQL-ProcBench, we present an experimental evaluation on several database engines to understand and identify research challenges and opportunities. We emphasize the need to work on these interesting and relevant problems, and encourage researchers to contribute to this area.

11:00 – 12:00 CESTResearch Session 9: Metadata Chaired by Wim Martens

The Smallest Extraction Problem [Download Paper] Valerio Cetorelli (Roma Tre University), Paolo Atzeni (Univ.of Roma 3), Valter Crescenzi (Roma Tre University), Franco Milicchio (Roma Tre University) We introduce landmark grammars, a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem(SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Glean: Structured Extractions from Templatic Documents [Download Paper] Sandeep Tata (Google, USA), Navneet Potti (Google), James B Wendt (Google), Lauro Beltrão Costa (Google), Marc Najork (Google), Beliz Gunel (Stanford University) Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones.
We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Preference Queries over Taxonomic Domains [Download Paper] Paolo Ciaccia (Università di Bologna), Davide Martinenghi (Politecnico di Milano), Riccardo Torlone (Universita Roma Tre) When composing multiple preferences characterizing the most suitable results for a user, several issues may arise. Indeed, preferences can be partially contradictory, suffer from a mismatch with the level of detail of the actual data, and even lack natural properties such as transitivity. In this paper we formally investigate the problem of retrieving the best results complying with multiple preferences expressed in a logic-based language. Data are stored in relational tables with taxonomic domains, which allow the specification of preferences also over values that are more generic than those in the database. In this framework, we introduce two operators that rewrite preferences for enforcing the important properties of transitivity, which guarantees soundness of the result, and specificity, which solves all conflicts among preferences. Although, as we show, these two properties cannot be fully achieved together, we use our operators to identify the only two alternatives that ensure transitivity and minimize the residual conflicts. Building on this finding, we devise a technique, based on an original heuristics, for selecting the best results according to the two possible alternatives. We finally show, with a number of experiments over both synthetic and real-world datasets, the effectiveness and practical feasibility of the overall approach.

11:00 – 12:00 CESTResearch Session 10: Graph Traversal Chaired by Bogdan Arsintescu

ThunderRW: An In-Memory Graph Random Walk Engine [Download Paper] Shixuan Sun (National University of Singapore), Yuhang Chen (National University of Singapore), Shengliang Lu (National University of Singapore), Bingsheng He (National University of Singapore), Yuchen Li (Singapore Management University) As random walk (RW) is a powerful tool in many graph processing, mining and learning applications, this paper proposes an efficient in-memory random walk engine named ThunderRW. Compared with existing parallel systems on improving the performance of a single graph operation, ThunderRW supports massive parallel random walks. The core design of ThunderRW is motivated by our profiling results: RW algorithms have as high as 73.1% CPU pipeline cycles stalled due to irregular memory access, which suffers significantly more memory stalls than the conventional graph workloads of a single graph operation such as BFS and SSSP. To improve the memory efficiency, we first design a generic step-centric programming model named Gather-Move-Update to abstract different RW algorithms. Based on the programming model, we develop the step interleaving technique to hide memory access latency by switching the execution of different random walk queries. In our experiments, we use four representative RW-algorithms including PPR, DeepWalk, Node2Vec and MetaPath to demonstrate efficiency and programming flexibility of ThunderRW. Experiment results show that ThunderRW outperforms state-of-the-art approaches by an order of magnitude, and the step interleaving technique significantly reduces the CPU pipeline stall from 73.1% to 15.0%.

Towards an Efficient Weighted Random Walk Domination [Download Paper] Songsong Mo (Wuhan University), Zhifeng Bao (RMIT University), Ping Zhang (huawei), Zhiyong Peng ( Wuhan University, China) In this paper, we propose and study a new problem called the weighted random walk domination. Given a weighted graph $G(V, E)$ and a budget $B$ of the weighted random walk, it aims to find a $k$-size set $S$, which can minimize the total costs of the remaining nodes to access $S$ through the weighted random walk, which is bounded by $B$. This problem is critical to a range of real-world applications, such as advertising in social networks and telecommunication base station selection in wireless sensor networks. We first present a dynamic programming based greedy method (DpSel) as a baseline. DpSel is time-consuming when $|V|$ is huge. Thus, to overcome this drawback, we propose a matrix-based greedy method (MatrixSel), which can reduce the computation cost greatly. To further accelerate MatrixSel, we propose a BoundSel approach to reduce the number of the gain computations in each candidate selection by proactively estimating the upper bound of the marginal gain of the candidate node. Notably, all methods can achieve an approximation ratio of $(1 - 1/e)$. Experiments on real datasets have been conducted to verify the efficiency, effectiveness, memory consumption and scalability of our methods.

Maximizing Social Welfare in a Competitive Diffusion Model [Download Paper] Prithu Banerjee (UBC), Laks V.S. Lakshmanan (The University of British Columbia), Wei Chen (Microsoft) Influence maximization (IM) has garnered a lot of attention in the literature owing to applications such as viral marketing and infection containment. It aims to select a small number of seed users to adopt an item such that adoption propagates to a large number of users in the network. Competitive IM focuses on the propagation of competing items in the network. Existing works on competitive IM have several limitations. (1) They fail to incorporate economic incentives in users' decision making in item adoptions. (2) Majority of the works aim to maximize the adoption of one particular item, and ignore the collective role that different items play. (3) They focus mostly on one aspect of competition -- pure competition. To address these concerns we study competitive IM under a utility-driven propagation model called UIC, and study social welfare maximization. The problem in general is not only NP-hard but also NP-hard to approximate within any constant factor. We, therefore, devise instant dependent efficient approximation algorithms for the general case as well as a $(1-1/e-\epsilon)$-approximation algorithm for a restricted setting. Our algorithms outperform different baselines on competitive IM, both in terms of solution quality and running time on large real networks under both synthetic and real utility configurations.

11:00 – 12:00 CESTResearch Session 11: Missing Values Chaired by Mourad Khayati

Missing Value Imputation on Multidimensional Time Series [Download Paper] Parikshit Bansal (IIT Bombay), Prathamesh Deshpande (IIT Bombay), Sunita Sarawagi (Indian Institute of Technology) We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data.
DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings.
Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

Adaptive Data Augmentation for Supervised Learning over Missing Data [Download Paper] Tongyu Liu (Renmin University of China), Ju Fan (Renmin University of China), Yinqing Luo (Renmin University of China), Nan Tang (Qatar Computing Research Institute, HBKU), Guoliang Li (Tsinghua University), Xiaoyong Du (Renmin University of China) Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the “repaired” source to train an ML model. During production, unlabeled target (a.k.a. test) data will also be repaired, and is then fed in the trained ML model for prediction. However, this process often causes a performance degradation when the source and target datasets are dirty with different noise patterns, which is common in practice.
In this paper, we propose an adaptive data augmentation approach, for handling missing data in supervised ML. The approach extracts noise patterns from target data, and adapts the source data with the extracted target noise patterns while still preserving supervision signals in the source. Then, it patches the ML model by retraining it on the adapted data, in order to better serve the target. To effectively support adaptive data augmentation, we propose a novel generative adversarial network (GAN) based framework, called DAGAN, which works in an unsupervised fashion. DAGAN consists of two connected GAN networks. The first GAN learns the noise pattern from the target, for target mask generation. The second GAN uses the learned target mask to augment the source data, for source data adaptation. The augmented source data is used to retrain the ML model. Extensive experiments show that our method significantly improves the ML model performance and is more robust than the state-of-the-art missing data imputation solutions for handling datasets with different missing value patterns.

Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions [Download Paper] Bojan Karlaš (ETH Zürich), Peng Li (GATECH), Renzhi Wu (Georgia Institute of Technology), Nezihe Merve Gürel (ETH Zürich), Xu Chu (GATECH), Wentao Wu (Microsoft Research), Ce Zhang (ETH) Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of ""Certain Predictions"" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction.
We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of ""data cleaning for machine learning"" (DC for ML). We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% of the gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% of the gap on average.

13:00 – 15:00 CESTResearch Session 12: Cardinality Estimation Chaired by Zoi Kaoudi

Fauce: Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation [Download Paper] Jie Liu (University of California, Merced), Wenqian Dong (University of California, Merced), Dong Li (University of California, Merced), Qingqing Zhou (Tencent Inc.) Cardinality estimation is a fundamental and critical problem in databases. Recently, many estimators based on deep learning have been proposed to solve this problem and they have achieved promising results. However, these estimators struggle to provide accurate results for complex queries, due to not capturing real inter-column and inter-table correlations. Furthermore, none of these estimators contain the uncertainty information about their estimations. In this paper, we present a join cardinality estimator called Fauce. Fauce learns the correlations across all columns and all tables in the database, it also contains the uncertainty information of each estimation. Among all studied learned estimators, our results are promising: (1) Fauce has the smallest model size; (2) It has the fastest inference speed; (3) Compared with the state of the art estimator, Fauce has 10 times faster inference speed, and provides 1.3 to 6.7 times smaller estimation errors for complex queries; (4) To the best of our knowledge, Fauce is the first estimator that incorporates cardinality estimation uncertainty information into a deep learning model.

FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation [Download Paper] Rong Zhu (Alibaba Group), Ziniu Wu (Alibaba Group), Yuxing Han (Alibaba Group), Kai Zeng (Alibaba Group), Andreas Pfadler (Alibaba Group), Zhengping Qian (Alibaba Group), Jingren Zhou (Alibaba Group), Bin Cui (Peking University) Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over-simplify the models only using independent factorization which leads to inaccurate estimates, or over-complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus combines their advantages. FLAT supports efficient online probability computation in near linear time on the underlying FSPN model, provides effective offline model construction and enables incremental model updates. It can estimate cardinality for both single table queries and multi-table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods: FLAT achieves 1–5 orders of magnitude better accuracy, 1–3 orders of magnitude faster probability computation speed and 1–2 orders of magnitude lower storage cost. We also integrate FLAT into Postgres to perform an end-to-end test. It improves the query execution time by 12.9% on the well-known IMDB benchmark workload, which is very close to the optimal result 14.2% using the true cardinality.

Are We Ready For Learned Cardinality Estimation? [Download Paper] Xiaoying Wang (Simon Fraser University), Changbo Qu (Simon Fraser University), Weiyuan Wu (Simon Fraser University), Jiannan Wang (Simon Fraser University), Qingqing Zhou (Tencent Inc.) Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this paper, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they cannot catch up with fast data up-dates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems

Astrid: Accurate Selectivity Estimation for String Predicates using Deep Learning [Download Paper] Suraj Shetiya (The University of Texas at Arlington), Saravanan Thirumuruganathan (QCRI), Nick Koudas (University of Toronto), Gautam Das (U. of Texas Arlington) Accurate selectivity estimation for string predicates is a longstanding research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement.
We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings `ab', `abc' and `abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for `ab' is closer to `abc' than `abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring and suffix queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Flow-Loss: Learning Cardinality Estimates That Matter [Download Paper] Parimarjan Negi (MIT CSAIL), Ryan C Marcus (MIT), Andreas Kipf (MIT), Hongzi Mao (MIT CSAIL), Nesime Tatbul (Intel Labs and MIT), Tim Kraska (MIT), Mohammad Alizadeh (Massachusetts Institute of Technology) Recently there has been significant interest in using machine learn- ing to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, for learning cardinality estimation models. Flow-Loss approximates the optimizer’s cost model and search algorithm with analytical functions, which it uses to optimize explicitly for better query plans. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain “plan graph”, in which different paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB) which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the plan costs and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models perform well. However, the Q-Error-trained model degrades significantly when evaluated on slightly different queries (e.g., similar but unseen query templates), while the Flow-Loss trained model generalizes better to such situations, achieving 4-8x better 99th percentile runtimes on unseen templates with the same model architecture and training data

NeuroCard: One Cardinality Estimator for All Tables [Download Paper] Zongheng Yang (UC Berkeley), Amog Kamsetty (UC Berkeley), Sifei Luan (UC Berkeley), Eric Liang (UC Berkeley), Yan Duan (COVARIANT.AI), Peter Chen (COVARIANT.AI), Ion Stoica (UC Berkeley) Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5× maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

13:00 – 15:00 CESTResearch Session 13: Spatial Paths and Maps Chaired by Constantinos Costa

Towards Crowd-aware Indoor Path Planning [Download Paper] Tiantian Liu (Aalborg University), Huan Li (Aalborg University), Hua Lu (Roskilde University), Muhammad Aamir Cheema (Monash University), Lidan Shou (Zhejiang University) Indoor venues accommodate many people who collectively form crowds. Such crowds in turn influence people’s routing choices, e.g., people may prefer to avoid crowded rooms when walking from A to B. This paper studies two types of crowd-aware indoor path planning queries. The Indoor Crowd-Aware Fastest Path Query (FPQ) finds a path with the shortest travel time in the presence of crowds, whereas the Indoor Least Crowded Path Query (LCPQ) finds a path encountering the least objects en route. To process the queries, we design a unified framework with three major components. First, an indoor crowd model organizes indoor topology and captures object flows between rooms. Second, a time-evolving population estimator derives room populations for a future timestamp to support crowd-aware routing cost computations in query processing. Third, two exact and two approximate query processing algorithms process each type of query. All algorithms are based on graph traversal over the indoor crowd model and use the same search framework with different strategies of updating the populations during the search process. All proposals are evaluated experimentally on synthetic and real data. The experimental results demonstrate the efficiency and scalability of our framework and query processing algorithms.

AutoGR: Automated Geo-Replication with Fast System Performance and Preserved Application Semantics [Download Paper] Jiawei Wang (USTC), Cheng Li (USTC), Kai Ma (University of Science and Technology of China), Jingze Huo (USTC), Feng Yan (University of Nevada, Reno), Xinyu Feng (Nanjing University), Yinlong Xu (University of Science and Technology of China) Geo-replication is essential for providing low latency response and quality Internet services. However, designing fast and correct geo-replicated services is challenging due to the complex trade-off between performance and consistency semantics in optimizing the expensive cross-site coordination. State-of-the-art solutions rely on programmers to derive sufficient application-specific invariants and code specifications, which is both time-consuming and error-prone. In this paper, we propose an end-to-end geo-replication deployment framework AutoGR (AUTOmated Geo-Replication) to free programmers from such label-intensive tasks. AutoGR enables the geo-replication features for non-distributed, serializable applications in an automated way with optimized performance and correct application semantics. Driven by a novel static analysis tool Rigi, AutoGR can extract invariants for applications by verifying whether their geo-replicated versions obey the serializable semantics of the non-distributed application. Rigi takes application codes as inputs and infers a set of possible side effects and path conditions possibly leading to consistency violations. Rigi employs the Z3 theorem prover to identify pairs of conflicting side effects and feed them to a geo-replication framework for automated across-site deployment. We evaluate AutoGR by transforming four DB-compliant applications that are originally non-replicated to geo-replicated ones across 3 sites. Compared with the state-of-the-art human-intervention-free automated approaches (e.g., for strong consistency), AutoGR reduces up to 61.8% latency and achieves up to 2.12X higher peak throughput; compared with state-of-the-art approaches relying on a manual analysis (e.g., PoR), AutoGR can quickly enable the geo-replication feature with zero human intervention while offering similarly low latency and high peak throughput.

PPQ-Trajectory: Spatio-temporal Quantization for Querying in Large Trajectory Repositories [Download Paper] Shuang Wang (University of Warwick), Hakan Ferhatosmanoglu (University of Warwick) We present PPQ-trajectory, a spatio-temporal quantization based solution for querying large dynamic trajectory data. PPQ-trajectory includes a partition-wise predictive quantizer (PPQ) that generates an error-bounded codebook with autocorrelation and spatial proximity-based partitions. The codebook is indexed to run approximate and exact spatio-temporal queries over compressed trajectories. PPQ-trajectory includes a coordinate quadtree coding for the codebook with support for exact queries. An incremental temporal partition-based index is utilised to avoid full reconstruction of trajectories during queries. An extensive set of experimental results for spatio-temporal queries on real trajectory datasets is presented. PPQ-trajectory shows significant improvements over the alternatives with respect to several performance measures, including the accuracy of results when the summary is used directly to provide approximate query results, the spatial deviation with which spatio-temporal path queries can be answered when the summary is used as an index, and the time taken to construct the summary. Superior results on the quality of the summary and the compression ratio are also demonstrated.

MDTP: A Multi-source Deep Traffic Prediction Framework over Spatio-Temporal Trajectory Data [Download Paper] Ziquan Fang (Zhejiang University), Lu Pan (Zhejiang University), Lu Chen (Zhejiang University), Yuntao Du (Zhejiang University), Yunjun Gao (Zhejiang University) Traffic prediction has drawn increasing attention for its ubiquitous real-life applications in traffic management, urban computing, public safety, and so on. Recently, the availability of massive trajectory data and the success of deep learning motivate a plethora of deep traffic prediction studies. However, the existing neural-network-based approaches tend to ignore the correlations between multiple types of moving objects located in the same spatio-temporal traffic area, which is suboptimal for traffic prediction analytics.
In this paper, we propose a multi-source deep traffic prediction framework over spatio-temporal trajectory data, termed as MDTP. The framework includes two phases: spatio-temporal feature modeling and multi-source bridging. We present an enhanced graph convolutional network (GCN) model combined with long short-term memory network (LSTM) to capture the spatial dependencies and temporal dynamics of traffic in the feature modeling phase. In the multi-source bridging phase, we propose two methods, Sum and Concat, to connect the learned features from different trajectory data sources. Extensive experiments on two real-life datasets show that MDTP i) has superior efficiency, compared with classical time-series methods, machine learning methods, and state-of-the-art neural network-based approaches; ii) offers a significant performance improvement over the single-source traffic prediction approach; and iii) performs traffic predictions in seconds even on tens of millions of trajectory data. Also, we develop MDTP+, a user-friendly interactive system to demonstrate traffic prediction analysis.

The Simpler The Better: An Indexing Approach for Shared-Route Planning Queries. [Download Paper] Yuxiang Zeng (Hong Kong University of Science and Technology), Yongxin Tong (Beihang University), Yuguang Song (Beihang University), Lei Chen (Hong Kong University of Science and Technology) Ridesharing services have gained global popularity as a con-venient, economic, and sustainable transportation mode in recent years. One fundamental challenge in these services is planning the shared-routes (i.e., sequences of origins and destinations) among the passengers for the vehicles, such that the platform’s total revenue is maximized. Though many methods can solve this problem, their eﬀectiveness is still far from optimal on either empirical study (e.g., over 31% lower total revenue than our approach) or theoretical study (e.g., arbitrarily bad or impractical theoretical guar-antee). In this paper, we study the shared-route planning queries in ridesharing services and focus on designing eﬃ-cient algorithms with good approximation guarantees. Par-ticularly, our idea is to iteratively search the most prof-itable route among the unassigned requests for each vehicle, which is simpler than the existing methods. Unexpectedly, we prove this simple method has an approximation ratio of 0.5 to the optimal result. Moreover, we also design an index called additive tree to improve the eﬃciency and ap-ply randomization to improve the approximation guarantee. Finally, experimental results on two real datasets demon-strate that our additive-tree-based approach outperforms the state-of-the-art algorithms by obtaining up to 31.4%-127.4% higher total revenue.

QARTA: An ML-based System for Accurate Map Services [Download Paper] Mashaal Musleh (University of Minnesota), Sofiane Abbar (Qatar Computing Research Institute), Rade Stanojevic (Qatar Computing Research Institute), Mohamed Mokbel (University of Minnesota - Twin Cities) Maps services (e.g., routing and store finding) are ubiquitous in widely used applications including navigation systems, ride sharing, and items/food delivery. There are plenty of efforts dedicated to supporting such services through designing more efficient algorithms, e.g., efficient shortest path and range queries. We believe that efficiency is no longer a bottleneck to these map services. Instead, it is the accuracy of the underlying road network and query result. This paper presents QARTA; an open-source full-fledged system for highly accurate and scalable map services. QARTA employs machine learning techniques to construct its own highly accurate map, not only in terms of map topology but more importantly, in terms of edge weights. QARTA also employs machine learning techniques to calibrate its query answers based on contextual information, including transportation modality, location, and time of day/week. QARTA is currently deployed in all Taxis in the State of Qatar and in the third-largest food delivery company in the country, replacing the commercial map service that was in use, and receiving hundreds of thousands of daily API calls with a real-time response time. Experimental evaluation of QARTA in such a real deployment environment shows that QARTA has comparable or higher accuracy than commercial services.

Fast Augmentation Algorithms for Network Kernel Density Visualization [Download Paper] Tsz Nam Chan (Hong Kong Baptist University), Zhe Li (The Hong Kong Polytechnic University), Leong Hou U (University of Macau), Jianliang Xu (Hong Kong Baptist University), Reynold Cheng (The University of Hong Kong, China) Network kernel density visualization, or NKDV, has been extensively used to visualize spatial data points in various domains, including traffic accident hotspot detection, crime hotspot detection, disease outbreak detection, and business and urban planning. Due to a wide range of applications for NKDV, some geographical software, e.g., ArcGIS, can also support this operation. However, computing NKDV is very time-consuming. Although NKDV has been used for more than a decade in different domains, existing algorithms are not scalable to million-sized datasets. To address this issue, we propose three efficient methods in this paper, namely aggregate distance augmentation (ADA), interval augmentation (IA), and hybrid augmentation (HA), which can significantly reduce the time complexity for computing NKDV. In our experiments, ADA, IA and HA can achieve at least 5x to 10x speedup, compared with the state-of-the-art solutions.

13:00 – 15:00 CESTResearch Session 14: Graph Matching Chaired by Panagiotis Karras

Parallel Graph Algorithms in Constant Adaptive Rounds: Theory meets Practice [Download Paper] Soheil Behnezhad (University of Maryland), Laxman Dhulipala (MIT CSAIL), Hossein Esfandiari (Google Research), Jakub Łącki (Google), Vahab Mirrokni (Google), Warren Schudy (Google) We study fundamental graph problems such as graph con-nectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching in a distributed setting. In particular, we focus on the Adaptive Massively Parallel Com-putation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table. We show the first AMPC algorithms for all of the studied problems that run in a constant number of rounds and use only O(n) space per machine, where 0 < < 1. Our results improve both upon the previous results in the AMPC model, as well as the best-known results in the MPC model, which is the theoretical model underpinning many popular distributed computation frameworks, such as MapReduce, Hadoop, Beam, Pregel and Giraph. Finally, we provide an empirical comparison of the algo-rithms in the MPC and AMPC models in a fault-tolerant distributed computation environment. We empirically evalu-ate our algorithms on a set of large real-world graphs and show that our AMPC algorithms can achieve improvements in both running time and round-complexity over optimized MPC baselines.

Optimizing Bipartite Matching in Real-World Applications by Incremental Cost Computation [Download Paper] Tenindra Abeywickrama (GrabTaxi Holdings Pte Ltd), Victor Liang (GrabTaxi Holdings Pte Ltd), Kian-Lee Tan (National University of Singapore) The Kuhn-Munkres (KM) algorithm is a classical combinatorial optimization algorithm that is widely used for minimum cost bipartite matching in many real-world applications, such as transportation. For example, a ride-hailing service may use it to find the optimal assignment of drivers to passengers to minimize the overall wait time. Typically, given two bipartite sets, this process involves computing the edge costs between all bipartite pairs and finding an optimal matching. However, existing works overlook the impact of edge cost computation on the overall running time. In reality, edge computation often significantly outweighs the computation of the optimal assignment itself, as in the case of assigning drivers to passengers which involves computation of expensive graph shortest paths. Following on from this observation, we observe common real-world settings exhibit a useful property that allows us to incrementally compute edge costs only as required using an inexpensive lower-bound heuristic. This technique significantly reduces the overall cost of assignment compared to the original KM algorithm, as we demonstrate experimentally on multiple real-world data sets, workloads, and problems. Moreover, our algorithm is not limited to this domain and is potentially applicable in other settings where lower-bounding heuristics are available.

RECEIPT: REfine CoarsE-grained IndePendent Tasks for Parallel Tip decomposition of Bipartite Graphs [Download Paper] Kartik Lakhotia (USC), Rajgopal Kannan (USC), Viktor K Prasanna (Unversity of Southern California), César De Rose (PUCRS) Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2, 2−bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update (peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization. In this paper, we propose a novel parallel tip-decomposition algorithm – REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the sequential restrictions on peeling by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time. We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms – achieving up to 1100x and 64x reduction in the number of thread synchronizations and traversed wedges, respectively. Using 36 threads, RECEIPT can provide up to 17.1x self-relative speedup. Our implementation of RECEIPT is available at https://github.com/kartiklakhotia/RECEIPT.

Symmetric Continuous Subgraph Matching with Bidirectional Dynamic Programming [Download Paper] Seunghwan Min (Seoul National University), Sung Gwan Park (Seoul National University), Kunsoo Park (Seoul National University), Dora Giammarresi (Universita Roma Tor Vergata), Giuseppe F. Italiano (LUISS University), Wook-Shin Han (POSTECH) In many real datasets such as social media streams and cyber data sources, graphs change over time through a graph update stream of edge insertions and deletions. Detecting critical patterns in such dynamic graphs plays an important role in various application domains such as fraud detection, cyber security, and recommendation systems for social networks. Given a dynamic data graph and a query graph, the continuous subgraph matching problem is to find all positive matches for each edge insertion and all negative matches for each edge deletion. The state-of-the-art algorithm TurboFlux uses a spanning tree of a query graph for filtering. However, using the spanning tree may have a low pruning power because it does not take into account all edges of the query graph. In this paper, we present a symmetric and much faster algorithm SymBi which maintains an auxiliary data structure based on a directed acyclic graph instead of a spanning tree, which maintains the intermediate results of bidirectional dynamic programming between the query graph and the dynamic graph. Extensive experiments with real and synthetic datasets show that SymBi outperforms the state-of-the-art algorithm by up to three orders of magnitude in terms of the elapsed time.

Massively Parallel Algorithms for Personalized PageRank [Download Paper] Guanhao Hou (The Chinese University of Hong Kong), Xingguang Chen (The Chinese University of Hong Kong), Sibo Wang (The Chinese University of Hong Kong), Zhewei Wei (Renmin University of China) Personalized PageRank (PPR) has wide applications in search engines, social recommendations, community detection, and so on. Nowadays, graphs are becoming massive and many IT companies need to deal with large graphs that cannot be fitted into the memory of most commodity servers. However, most existing state-of-the-art solutions for PPR computation only work for single-machines and are inefficient for the distributed framework since such solutions either {\em (i)} result in an excessively large number of communication rounds, or {\em (ii)} incur high communication costs in each round.
Motivated by this, we present {\em Delta-Push}, an efficient framework for single-source and top-$k$ PPR queries in distributed settings. Our goal is to reduce the number of rounds while guaranteeing that the load, i.e., the maximum number of messages an executor sends or receives in a round, can be bounded by the capacity of each executor. We first present a non-trivial combination of a redesigned parallel push algorithm and the Monte-Carlo method to answer single-source PPR queries. The solution uses pre-sampled random walks to reduce the number of rounds for the push algorithm. Theoretical analysis under the {\em Massively Parallel Computing (MPC)} model shows that our proposed solution bounds the communication rounds to $O(\log{\frac{n^2\log{n}}{\epsilon^2m}})$ under a load of $O(m/p)$, where $m$ is the number of edges of the input graph, $p$ is the number of executors, and $\epsilon$ is a user-defined error parameter. In the meantime, as the number of executors increases to $p' = \gamma \cdot p$, the load constraint can be relaxed since each executor can hold $O(\gamma \cdot m/p')$ messages with invariant local memory. In such scenarios, multiple queries can be processed in batches simultaneously. We show that with a load of $O(\gamma \cdot m/p')$, our Delta-Push can process $\gamma$ queries in a batch with $O(\log{\frac{n^2\log{n}}{\gamma\epsilon^2m}})$ rounds, while other baseline solutions still keep the same round cost for each batch. We further present a new top-$k$ algorithm that is friendly to the distributed framework and reduces the number of rounds required in practice. Extensive experiments show that our proposed solution is more efficient than alternatives.

18Aug

09:00 – 10:30 CESTResearch Session 16: NLP for Databases Chaired by Jacopo Urbani

From Natural Language Processing to Neural Databases [Download Paper] James Thorne (University of Cambridge), Majid Yazdani (Facebook), Marzieh Saeidi (Facebook), Fabrizio Silvestri (Facebook), Sebastian Riedel (), Alon Y Halevy (Facebook) In recent years, neural networks have shown impressive performance gains on long-standing AI problems, such as answering queries from text and machine translation. These advances raise the question of whether neural nets can be used at the core of query processing to derive answers from facts, even when the facts are expressed in natural language. If so, it is conceivable that we could relax the fundamental assumption of database management, namely, that our data is represented as fields of a pre-defined schema. Furthermore, such technology would enable combining information from text, images, and structured data seamlessly. This paper introduces neural databases, a class of systems that use NLP transformers as localized answer derivation engines. We ground the vision in NeuralDB, a system for querying facts represented as short natural language sentences. We demonstrate that recent natural language processing models, specifically transformers, can answer select-project-join queries if they are given a set of relevant facts. However, they cannot scale to non-trivial databases nor answer set-based and aggregation queries. Based on these insights, we identify specific research challenges that are needed to build neural databases. Some of the challenges require drawing upon the rich literature in data management, and others pose new research opportunities to the NLP community. Finally, we show that with preliminary solutions, NeuralDB can already answer queries over thousands of sentences with very high accuracy.

The Case for NLP-Enhanced Database Tuning: Towards Tuning Tools that "Read the Manual" [Download Paper] Immanuel Trummer (Cornell) A large body of knowledge on database tuning is available in the form of natural language text. We propose to leverage natural language processing (NLP) to make that knowledge accessible to automated tuning tools. We describe multiple avenues to exploit NLP for database tuning, and outline associated challenges and opportunities. As a proof of concept, we describe a simple prototype system that exploits recent NLP advances to mine tuning hints from Web documents. We show that mined tuning hints improve performance of MySQL and Postgres on TPC-H, compared to the default configuration.

Quality of Sentiment Analysis Tools: The Reasons of Inconsistency [Download Paper] Wissam Mammar Kouadri (University of Paris), Mourad Ouziri (University of Paris), Salima Benbernou (Université Paris Descartes), Karima Echihabi (Mohammed VI Polytechnic University), Themis Palpanas (University of Paris), Iheb Benamor (imba consulting) In this paper, we present a comprehensive study that evaluates six state-of-the-art sentiment analysis tools on five public datasets, based on the quality of predictive results in the presence of semantically equivalent documents, i.e., how consistent existing tools are in predicting the polarity of documents based on paraphrased text. We observe that sentiment analysis tools exhibit intra-tool inconsistency, which is the prediction of different polarity for semantically equivalent documents by the same tool, and inter-tool inconsistency, which is the prediction of different polarity for semantically equivalent documents across different tools. We introduce a heuristic to assess the data quality of an augmented dataset and a new set of metrics to evaluate tool inconsistencies. Our results indicate that tool inconsistencies is still an open problem, and they point towards promising research directions and accuracy improvements that can be obtained if such inconsistencies are resolved.

CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs [Download Paper] Abdelghny Orogat (Carleton University), Isabelle Liu (Carleton University), Ahmed El-Roby (Carleton University) Recently, there has been an increase in the number of knowledge graphs that can be only queried by experts. However, describing questions using structured queries is not straightforward for non-expert users who need to have sufficient knowledge about both the vocabulary and the structure of the queried knowledge graph, as well as the syntax of the structured query language used to describe the user's information needs. The most popular approach introduced to overcome the aforementioned challenges is to use natural language to query these knowledge graphs. Although several question answering benchmarks can be used to evaluate question-answering systems over a number of popular knowledge graphs, choosing a benchmark to accurately assess the quality of a question answering system is a challenging task.
In this paper, we introduce CBench, an extensible, and more informative benchmarking suite for analyzing benchmarks and evaluating question answering systems. CBench can be used to analyze existing benchmarks with respect to several fine-grained linguistic, syntactic, and structural properties of the questions and queries in the benchmark. We show that existing benchmarks vary significantly with respect to these properties deeming choosing a small subset of them unreliable in evaluating QA systems. Until further research improves the quality and comprehensiveness of benchmarks, CBench can be used to facilitate this evaluation using a set of popular benchmarks that can be augmented with other user-provided benchmarks. CBench not only evaluates a question answering system based on popular single-number metrics but also gives a detailed analysis of the linguistic, syntactic, and structural properties of answered and unanswered questions to better help the developers of question answering systems to better understand where their system excels and where it struggles.

Improving Information Extraction from Visually Rich Documents using Visual Span Representations [Download Paper] Ritesh Sarkhel (Ohio State University), Arnab Nandi (The Ohio State University) Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis – a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score.

09:00 – 10:30 CESTResearch Session 17: Machine Learning for Databases Chaired by Tim Kraska

Explaining Inference Queries with Bayesian Optimization [Download Paper] Brandon Lockhart (Simon Fraser University), Jinglin Peng (Simon Fraser University), Weiyuan Wu (Simon Fraser University), Jiannan Wang (Simon Fraser University), Eugene Wu (Columbia University) Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.

An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems [Download Paper] Dana M Van Aken (Carnegie Mellon University), Dongsheng Yang (Princeton University), Sebastien Brillard (Societe Generale), Ari Fiorino (Carnegie Mellon University), Bohan Zhang (OtterTune), Christian Billian (Societe Generale), Andrew Pavlo (Carnegie Mellon University) Modern database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application's workload can improve the performance and efficiency of the DBMS. But because of their complexity, tuning a DBMS often requires considerable efforts from experienced database administrators (DBAs). Recent work on automated tuning methods using machine learning (ML) have shown to achieve better performance compared with expert DBAs. These ML-based methods, however, were evaluated on synthetic workloads with limited tuning opportunities, and thus it is unknown whether they provide the same benefit in a production environment.
To better understand ML-based tuning, we conducted a thorough evaluation of ML-based DBMS knob tuning methods on an enterprise database application. We use the OtterTune tuning service to compare three state-of-the-art ML algorithms on an Oracle installation with a real workload trace. Our results with OtterTune show that these algorithms generate knob configurations that improve performance by up to 45% over enterprise-grade configurations. We also identify several deployment and measurement issues that we encountered in our study that were overlooked by previous research in automated DBMS tuning services.

CGPTuner: a Contextual Gaussian Process Bandit Approach for the Automatic Tuning of IT Configurations Under Varying Workload Conditions [Download Paper] Stefano Cereda (Politecnico di Milano), Stefano Valladares (Akamas), Paolo Cremonesi (Politecnico di Milano), Stefano Doni (Akamas) Properly selecting the configuration of a database management system (DBMS) is essential to increase performance and reduce costs. However, the task is astonishingly tricky due to a large number of tunable configuration parameters and their inter-dependencies. Also, the optimal configuration depends upon the workload to which the DBMS is exposed. To extract the full potential of a DBMS, we must also consider the entire IT stack on which the DBMS is running, comprising layers like the Java virtual machine, the operating system and the physical machine. Each layer offers a multitude of parameters that we should take into account. The available parameters vary as new software versions are released, making it impractical to rely on historical knowledge bases. We present a novel tuning approach for the DBMS configuration autotuning that quickly finds a well-performing configuration of an IT stack and adapts it to workload variations, without having to rely on a knowledge base. We evaluate the proposed approach using the Cassandra and MongoDB DBMSs, showing that it adjusts the suggested configuration to the observed workload and is portable across different IT applications. We try to minimise the memory consumption without increasing the response time, showing that the proposed approach reduces the response time and increases the memory requirements only under heavy-load conditions, reducing it again when the load decreases.

Database Technology for the Masses: Sub-Operators as First-Class Entities [Download Paper] Maximilian Bandle (TUM), Jana Giceva (TU Munich) Relational databases have developed a wealth of technology over decades that has been successfully tried and tested in many settings and use cases. Yet, often most of it is left aside in the pursuit of performance (e.g., NoSQL) or new functionality (e.g., graph data, machine learning). In this paper, we argue that a wide range of techniques readily available in databases are crucial to tackling the challenges the IT industry faces in managing hardware trends, growing workloads, and the overall complexity of a quickly changing application and platform landscape.
However, to be truly useful, these techniques must be freed from the legacy component of database engines: relational operators. Therefore, we argue that in order to make databases a more flexible platform and extend their functionality to new data types and operations, it requires to expose a lower level of abstraction: instead of working with SQL, database engines should compile, optimize, and run a collection of sub-operators for manipulating and managing data, offering them as an external interface. In this paper, we discuss the advantages of doing so, provide a first list of such sub-operators, and show how they can be used in practice.

Data Acquisition for Improving Machine Learning Models [Download Paper] Yifan Li (York University), Xiaohui Yu (York University), Nick Koudas (University of Toronto) The vast advances in Machine Learning over the last ten years have been powered by the availability of suitably prepared data for training purposes. The future of ML-enabled enterprise hinges on data. As such there is already a vibrant market offering data annotation services to tailor sophisticated ML models.
In this paper, inspired by the recent vision of online data markets and associated market designs, we present research on the practical problem of obtaining data in order to improve the accuracy of ML models. We consider an environment in which consumers query for data to enhance the accuracy of their models and data providers that possess data and make them available for training purposes. We first formalize this interaction process laying out the suitable framework and associated parameters for data exchange. We then propose data acquisition strategies that consider a trade-off between exploration during which we obtain data to learn about the data distribution of a provider and exploitation during which we optimize our data inquiries utilizing the gained knowledge. We propose two algorithms for this purpose. First, the estimation and allocation (EA) strategy during which we utilize queries to estimate the utility of various predicates while learning about the data distribution of the provider; then we proceed with the allocation step in which we utilize these learned utilities to base our data acquisition decisions. The second algorithmic proposal, named Sequential Predicate Selection (SPS) utilizes a sampling strategy to explore the data distribution of the provider, adaptively investing more resources to parts of the data space that are statistically more promising to improve overall model accuracy.
We present a detailed experimental evaluation of our proposals utilizing a variety of ML models and associated real data sets exploring all applicable parameters of interest. Our results indicate the relative benefits of the proposed algorithms. Depending on the models we train and the associated learning tasks we identify trade-offs and highlight the relative benefits of each algorithm to further optimize model accuracy.

09:00 – 10:30 CESTResearch Session 18: Query Optimization Chaired by Sven Groppe

Accelerating Approximate Aggregation Queries with Expensive Predicates [Download Paper] Daniel Kang (Stanford University), John Guibas (Stanford University), Peter D Bailis (Stanford University), Tatsunori Hashimoto (Stanford), Yi Sun (University of Chicago), Matei Zaharia (Stanford and Databricks) Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard techniques to accelerate these queries in the form of approximate query processing are not applicable because they assume the result of the predicates are available ahead of time, i.e., as structured records in a database. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates.
To accelerate aggregation queries with expensive predicates, we develop a novel stratified sampling algorithm that leverages proxies (ABae). ABae must account for the key challenge that we may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, we propose a two-stage sampling algorithm that estimates optimal allocation for a fixed sampling budget and then samples according to this allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3x.

Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing [Download Paper] Zuozhi Wang (U C IRVINE), Kai Zeng (Alibaba Group), Botong Huang (Alibaba), Wei Chen (Alibaba), Xiaozong Cui (Alibaba), Bo Wang (Alibaba), Ji Liu (Alibaba), Liya Fan (Alibaba), Dachuan Qu (Alibaba), zhenyu hou (Alibaba Group), Tao Guan (Alibaba Group), Chen Li (UC Irvine), Jingren Zhou (Alibaba Group) Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We evaluate Tempura in various incremental processing scenarios to show its effectiveness and efficiency.

Aggregated Deletion Propagation for Counting Conjunctive Query Answers [Download Paper] Xiao Hu (Duke University), Shouzhuo Sun (Duke University), Shweta Patwa (Duke University), Debmalya Panigrahi (Duke University), Sudeepa Roy (Duke University, USA) We investigate the computational complexity of minimizing the source side-effect in order to remove a given number of tuples from the output of a conjunctive query. This is a variant of the well-studied deletion propagation problem, the difference being that we are interested in removing a smallest subset of input tuples to remove a given number of output tuples while deletion propagation focuses on removing a specific output tuple. We call this the Aggregated Deletion Propagation problem. We completely characterize the poly-time solvability of this problem for arbitrary conjunctive queries without self-joins. This includes a poly-time algorithm to decide solvability, as well as an exact structural characterization of NP-hard instances. We also provide a practical algorithm for this problem (a heuristic for NPhard instances) and evaluate its experimental performance on real and synthetic datasets.

Adaptive Code Generation for Data-Intensive Analytics [Download Paper] Wangda Zhang (Columbia University), Junyoung Kim (Columbia University), Kenneth A Ross (Columbia University), Eric Sedlar (Oracle), Lukas Stadler (Oracle Labs) Modern database management systems employ sophisticated query optimization techniques that enable the generation of efficient plans for queries over very large data sets. A variety of other applications also process large data sets, but cannot leverage database-style query optimization for their code. We therefore identify an opportunity to enhance an open-source programming language compiler with database-style query optimization. Our system dynamically generates execution plans at query time, and runs those plans on chunks of data at a time. Based on feedback from earlier chunks, alternative plans might be used for later chunks. The compiler extension could be used for a variety of data-intensive applications, allowing all of them to benefit from this class of performance optimizations.

DISK: A Distributed Framework for Single-Source SimRank with Accuracy Guarantee [Download Paper] Yue Wang (Shenzhen Institute of Computing Sciences, Shenzhen University.), Ruiqi Xu (University of Edinburgh), Zonghao Feng (Hong Kong University of Science and Technology), Yulin Che (Hong Kong University of Science and Technology), Lei Chen (Hong Kong University of Science and Technology), Qiong Luo (Hong Kong University of Science and Technology), Rui Mao (Shenzhen University) Measuring similarities among different nodes is important in graph analysis. SimRank is one of the most popular similarity measures. Given a graph $G(V,E)$ and a source node $u$, a single-source SimRank query returns the similarities between $u$ and each node $v \in V$. This type of query is often used in link prediction, personalized recommendation and spam detection. While dealing with a large graph is beyond the ability of a single machine due to its limited memory and computational power, it is necessary to process single-source SimRank queries in a distributed environment, where the graph is partitioned and distributed across multiple machines. However, most current solutions are based on shared-memory model, where the whole graph is loaded into a shared memory and all processors can access the graph randomly. It is difficult to deploy such algorithms on shared-nothing model. In this paper, we present DISK, a distributed framework for processing single-source SimRank queries. DISK follows the linearized formulation of SimRank, and consists of offline and online phases. In the offline phase, a tree-based method is used to estimate the diagonal correction matrix of SimRank accurately, and in the online phase, single-source similarities are computed iteratively. Under this framework, we propose different optimization techniques to boost the indexing and queries. DISK guarantees both accuracy and parallel scalability, which distinguishes itself from existing solutions. Its accuracy, efficiency, parallel scalability and scalability are also verified by extensive experimental studies. The experiments show that DISK scales up to graphs of billions of nodes and edges, and answers online queries within seconds, while ensuring the accuracy bounds.

09:00 – 10:30 CESTResearch Session 19: Data Streams I Chaired by Fabian Hueske

In the Land of Data Streams where Synopses are Missing, One Framework to Bring Them All [Download Paper] Rudi Poepsel-Lemaitre (Technische Universität Berlin), Martin Kiefer (TU Berlin), Joscha von Hein (TU Berlin), Jorge Arnulfo Quiane Ruiz (TU Berlin), Volker Markl (Technische Universität Berlin) In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses’ implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor’s key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.

PR-Sketch: Monitoring Per-key Aggregation of Streaming Data with Nearly Full Accuracy [Download Paper] Qun Huang (Peking University), Sa Wang (Chinese Academy of Sciences) Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like anomaly detection, attack prevention, and online diagnosis. Even though many algorithms have been proposed for per-key aggregation in stream processing, their accuracy guarantees only cover a small portion of keys. In this paper, we aim to achieve nearly full accuracy with limited resource usage. We follow the line of sketch-based techniques. We observe that existing methods suffer from high errors for most keys. The reason is that they track keys by complicated mechanism in the update phase and simply calculate per-key aggregation from some specific counter in the recovery phase. Therefore, we present PR-Sketch, a novel sketching design to address the two limitations. PR-Sketch builds linear equations between counter values and per-key aggregations to improve accuracy, and records keys in the recovery phase to reduce resource usage in the update phase. We also provide an extension called fast PR-Sketch to improve processing rate further. We derive space complexity, time complexity, and guaranteed error probability for both PR-Sketch and fast PRSketch. We conduct trace-driven experiments under 100K keys and 1M items to compare our algorithms with multiple state-of-the-art methods. Results demonstrate the resource efficiency and nearly full accuracy of our algorithms.

Ananke: A Streaming Framework for Live Forward Provenance [Download Paper] Dimitris Palyvos-Giannas (Chalmers University of Technology), Bastian Havers (Chalmers University of Technology and Volvo Car Corporation), Marina Papatriantafilou (Chalmers University of Technology), Vincenzo Gulisano (Chalmers University of Technology) Data streaming enables online monitoring of large and continuous event streams in Cyber-Physical Systems (CPSs). In such scenarios, fine-grained backward provenance tools can connect streaming query results to the source data producing them, allowing analysts to study the dependency/causality of CPS events. While CPS monitoring commonly produces many events, backward provenance does not help prioritize event inspection since it does not specify if an event's provenance could still contribute to future results.
To cover this gap, we introduce Ananke, a framework to extend any fine-grained backward provenance tool and deliver a live bipartite graph of fine-grained forward provenance. With Ananke, analysts can prioritize the analysis of provenance data based on whether such data is still potentially being processed by the monitoring queries. We prove our solution is correct, discuss multiple implementations, including one leveraging streaming APIs for parallel analysis, and show Ananke results in small overheads, close to those of existing tools for fine-grained backward provenance.

Optimization of Threshold Functions over Streams [Download Paper] Walter Cai (University of Washington), Philip Bernstein (Microsoft Research), Wentao Wu (Microsoft Research), Badrish Chandramouli (Microsoft Research) A common stream processing application is alerting, where the data stream management system (DSMS) continuously evaluates a threshold function over incoming streams. If the threshold is crossed, the DSMS raises an alarm. The threshold function is often calculated over two or more streams, such as combining temperature and humidity readings to determine if moisture will form on a machine and therefore cause it to malfunction. This requires taking a temporal join across the input streams. We show that for the broad class of functions called quasiconvex functions, the DSMS needs to retain very few tuples per-data-stream for any given time interval and still never miss an alarm. This surprising result yields a large memory savings during normal operation. That savings is also important if one stream fails, since the DSMS would otherwise have to cache all tuples in other streams until the failed stream recovers. We prove our algorithm is optimal and provide experimental evidence that validates its substantial memory savings.

TRACE: Real-time Compression of Streaming Trajectories in Road Networks [Download Paper] Tianyi Li (Aalborg Univeristy), Lu Chen (Zhejiang University), Christian S Jensen (Aalborg University), Torben Bach Pedersen (Aalborg University) The deployment of vehicle location services generates increasingly massive vehicle trajectory data, which incurs high storage and transmission costs. A range of studies target offline compression to reduce the storage cost. However, to enable online services such as real-time traffic monitoring, it is attractive to also reduce transmission costs by being able to compress streaming trajectories in real-time. Hence, we propose a framework called TRACE that enables compression, transmission, and querying of network-constrained streaming trajectories in a fully online fashion. We propose a compact two-stage representation of streaming trajectories: a speed-based representation removes redundant information, and a multiple-references based referential representation exploits subtrajectory similarities. In addition, the online referential representation is extended with reference selection, deletion and rewriting functions that further improve the compression performance. An efficient data transmission scheme is provided for achieving low transmission overhead. Finally, indexing and filtering techniques support efficient real-time range queries over compressed trajectories. Extensive experiments with real-life and synthetic datasets evaluate the different parts of TRACE, offering evidence that it is able to outperform the existing representative methods in terms of both compression ratio and transmission cost.

09:00 – 10:30 CESTResearch Session 20: Graph Management I Chaired by Matteo Lissandrini

Collective Influence Maximization for Multiple Competing Products with an Awareness-to-Influence Model [Download Paper] Dimitris Tsaras (HKUST), George Trimponias (Amazon Search), Lefteris Ntaflos (HKUST), Dimitris Papadias (HKUST) Influence maximization (IM) is a fundamental task in social network analysis. Typically, IM aims at selecting a set of seeds for the network that influences the maximum number of individuals. Motivated by practical applications, in this paper we focus on an IM variant, where the owner of multiple competing products wishes to select seeds for each product so that the collective influence across all products is maximized. To capture the competing diffusion processes, we introduce an Awareness-to-Influence (AtI) model. In the first phase, awareness about each product propagates in the social graph unhindered by other competing products. In the second phase, a user adopts the most preferred product among those encountered in the awareness phase. To compute the seed sets, we propose GCW, a game-theoretic framework that views the various products as agents, which compete for influence in the social graph and selfishly select their individual strategy. We show that AtI exhibits monotonicity and submodularity; importantly, GCW is a monotone utility game. This allows us to develop an efficient best-response algorithm, with quality guarantees on the collective utility. Our experimental results suggest that our methods are effective, efficient, and scale well to large social networks.

Finding Group Steiner Trees in Graphs with both Vertex and Edge Weights [Download Paper] Yahui Sun (National University of Singapore), Xiaokui Xiao (National University of Singapore), Bin Cui (Peking University), Saman Halgamuge (University of Melbourne), Theodoros Lappas (Stevens Institute of Technology), Jun Luo (Nanyang Technological University) Given an undirected graph and a number of vertex groups, the group Steiner tree problem is to find a tree such that (i) this tree contains at least one vertex in each vertex group; and (ii) the sum of vertex and edge weights in this tree is minimized. Solving this problem is useful in various scenarios, ranging from social networks to knowledge graphs. Most existing work focuses on solving this problem in vertex-unweighted graphs, and not enough work has been done to solve this problem in graphs with both vertex and edge weights. Here, we develop several algorithms to address this issue. Initially, we extend two algorithms from vertex-unweighted graphs to vertex- and edge-weighted graphs. The first one has no approximation guarantee, but often produces good solutions in practice. The second one has an approximation guarantee of |Γ| − 1, where |Γ| is the number of vertex groups. Since the extended (|Γ| − 1)- approximation algorithm is too slow when all vertex groups are large, we develop two new (|Γ| − 1)-approximation algorithms that overcome this weakness. Furthermore, by employing a dynamic programming approach, we develop another (|Γ|−ℎ+1)-approximation algorithm, where ℎ is a parameter between 2 and |Γ|. Experiments show that, while no algorithm is the best in all cases, our algorithms considerably outperform the state of the art in many scenarios.

Shortest Paths and Centrality in Uncertain Networks [Download Paper] Arkaprava Saha (Nanyang Technological University), Ruben Brokkelkamp (CWI Amsterdam), Yllka Velaj (University of Vienna), Arijit Khan (Nanyang Technological University), Francesco Bonchi (ISI Foundation, Turin) Computing the shortest path between a pair of nodes is a fundamental graph primitive, which has critical applications in vehicle routing, finding functional pathways in biological networks, survivable network design, among many others. In this work, we study shortest-path queries over uncertain networks, i.e., graphs where every edge is associated with a probability of existence. We show that, for a given path, it is #P-hard to compute the probability of it being the shortest path, and we also derive other interesting properties highlighting the complexity of computing the Most Probable Shortest Paths (MPSPs). We thus devise sampling-based efficient algorithms, with end-to-end accuracy guarantees, to compute the MPSP. As a concrete application, we show how to compute a novel concept of betweenness centrality in an uncertain graph using MPSPs. Our thorough experimental results and rich real-world case studies on sensor networks and brain networks validate the effectiveness, efficiency, scalability, and usefulness of our solution.

Hierarchical Core Maintenance on Large Dynamic Graphs [Download Paper] Zhe Lin (East China Normal University), Fan Zhang (Guangzhou University), Xuemin Lin (University of New South Wales), Wenjie Zhang (University of New South Wales), Zhihong Tian (Guangzhou University) The model of 𝑘-core and its decomposition have been applied in various areas, such as social networks, the world wide web, and biology. A graph can be decomposed into an elegant 𝑘-core hierarchy to facilitate cohesive subgraph discovery and network analysis. As many real-life graphs are fast evolving, existing works proposed efficient algorithms to maintain the coreness value of every vertex against structure changes. However, the maintenance of the 𝑘-core hierarchy in existing studies is not complete because the connections among different 𝑘-cores in the hierarchy are not considered. In this paper, we study hierarchical core maintenance which is to compute the 𝑘-core hierarchy incrementally against graph dynamics. The problem is challenging because the change of hierarchy may be large and complex even for a slight graph update. In order to precisely locate the area affected by graph dynamics, we conduct in-depth analyses on the structural properties of the hierarchy, and propose well-designed local update techniques. Our algorithms significantly outperform the baselines on runtime by up to 3 orders of magnitude, as demonstrated on 10 real-world large graphs.

Local Algorithms for Distance-generalized Core Decomposition over Large Dynamic Graphs [Download Paper] Qing Liu (Hong Kong Baptist University), Xuliang Zhu (Hong Kong Baptist University), Xin Huang (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University) The distance-generalized core, also called (k,h)-core, is defined as the maximal subgraph in which every vertex has at least $k$ vertices at distance no longer than h. Compared with k-core, (k,h)-core can identify more fine-grained subgraphs and, hence, is more useful for the applications such as network analysis and graph coloring. The state-of-the-art algorithms for (k,h)-core decomposition are peeling algorithms, which iteratively delete the vertex with the minimum $h$-degree (i.e., the least number of neighbors within h hops). However, they suffer from some limitations, such as low parallelism and incapability of supporting dynamic graphs. To address these limitations, in this paper, we revisit the problem of (k,h)-core decomposition. First, we introduce two novel concepts of pairwise h-attainability index and n-order H-index based on an insightful observation. Then, we thoroughly analyze the properties of n-order H-index and propose a parallelizable local algorithm for (k,h)-core decomposition. Moreover, several optimizations are presented to accelerate the local algorithm. Furthermore, we extend the proposed local algorithm to address the (k,h)-core maintenance problem for dynamic graphs. Experimental studies on real-world graphs show that, compared to the best existing solution, our proposed algorithms can reduce the (k,h)-core decomposition time by 1-3 orders of magnitude and save the maintenance cost by 1-2 orders of magnitude.

11:00 – 12:00 CESTResearch Session 21: User Interaction Chaired by Eleni Tzirita Zacharatou

NOAH: Interactive Spreadsheet Exploration with Dynamic Hierarchical Overviews [Download Paper] Sajjadur Rahman (Megagon Labs), Mangesh Bendre (VISA Research), Yuyang Liu (University of Illinois at Urbana-Champaign), Shichu Zhu (Google LLC), Zhaoyuan Su (University of California, Irvine), Karrie Karahalios (University of Illinois at Urbana-Champaign), Aditya Parameswaran (University of California, Berkeley) Spreadsheet systems are by far the most popular platform for data exploration on the planet, supporting millions of rows of data. However, exploring spreadsheets that are this large via operations such as scrolling or issuing formulae can be overwhelming and error-prone. Users easily lose context and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. To address these challenges, we introduce dynamic hierarchical overviews that are embedded alongside spreadsheets. Users can employ this overview to explore the data at various granularities, zooming in and out of the spreadsheet. They can issue formulae over data subsets without cumbersome scrolling or range selection, enabling users to gain a high or low-level perspective of the spreadsheet. An implementation of our dynamic hierarchical overview, NOAH, integrated within DataSpread, preserves spreadsheet semantics and look and feel, while introducing such enhancements. Our user studies demonstrate that NOAH makes it more intuitive, easier, and faster to navigate spreadsheet data compared to traditional spreadsheets like Microsoft Excel and spreadsheet plug-ins like Pivot Table, for a variety of exploration tasks; participants made fewer mistakes in NOAH while being faster in completing the tasks.

Robust Voice Querying with MUVE: Optimally Visualizing Results of Phonetically Similar Queries [Download Paper] Ziyun Wei (Cornell University), Immanuel Trummer (Cornell), Connor Anderson (Cornell University) Recently proposed voice query interfaces translate voice input into SQL queries. Unreliable speech recognition on top of the intrinsic challenges of text-to-SQL translation makes it hard to reliably interpret user input. We present MUVE (Multiplots for Voice quEries), a system for robust voice querying. MUVE reduces the impact of ambiguous voice queries by filling the screen with multiplots, capturing results of phonetically similar queries. It maps voice input to a probability distribution over query candidates, executes a selected subset of queries, and visualizes their results in a multiplot. Our goal is to maximize the probability to show the correct query result. Also, we want to optimize the visualization(e.g., by coloring a subset of likely results) in order to minimize the expected time until users find the correct result. Via a user study, we validate a simple cost model estimating the latter overhead. The resulting optimization problem is NP-hard. We propose an exhaustive algorithm, based on integer programming, as well as a greedy heuristic. As shown in a corresponding user study, MUVE enables users to identify accurate results faster, compared to prior work.

PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes [Download Paper] Phanwadee Sinthong (University of California, Irvine), Michael Carey (UC Irvine) In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by providing the data scientists' familiar interface, Pandas Dataframe, and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a large volume of data, the initial version only generated SQL++ queries and only operated against AsterixDB. In this work, we describe a new design that retargets AFrame’s incremental query formation to other query-based database systems, making it more flexible for deployment against other data management systems with composable query languages.

11:00 – 12:00 CESTResearch Session 22: Search Chaired by Constantinos Costa

LES3: Learning-based exact set similarity search [Download Paper] Yifan Li (York University), Xiaohui Yu (York University), Nick Koudas (University of Toronto) Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach, first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic data sets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.

DeltaPQ: Lossless Product Quantization Code Compression for High Dimensional Similarity Search [Download Paper] Runhui Wang (Rutgers University), (Rutgers University) High dimensional data is ubiquitous and plays an impor-tant role in many applications. However, the size of high dimensional data is usually excessively large. To alleviate this problem, in this paper, we develop novel techniques to compress and search high dimensional data. Speciﬁcally, we ﬁrst apply vector quantization, a classical lossy data com-pression method. It quantizes a high dimensional vector to a sequence of small integers, namely the quantization code. Then, we propose a novel lossless compression algo-rithm, DeltaPQ, to further compress the quantization codes. DeltaPQ organizes the quantization codes in a tree struc-ture and stores the diﬀerences between two quantization codes rather than the original codes. Among the exponential number of possible tree structures, we develop an eﬃcient algorithm, whose time and space complexity are linear to the number of codes, to ﬁnd the one with optimal compres-sion ratio. The approximate nearest neighbor search query can be processed directly on the compressed data with small space overhead in a few bytes. Many similarity measures can be supported, such as inner product, cosine similarity, Eu-clidean distance, and Lp-norm. Experimental results on ﬁve large-scale real-world datasets show that DeltaPQ achieves a compression ratio of up to 5 (and often greater than 2) on the quantization codes whereas the state-of-art general-purpose lossless compression algorithms barely work.

Budget Constrained Interactive Search for Multiple Targets [Download Paper] Xuliang Zhu (Hong Kong Baptist University), Xin Huang (Hong Kong Baptist University), Byron Choi (Hong Kong Baptist University), Jiaxin Jiang (Hong Kong Baptist University), Zhaonian Zou (Harbin Institute of Technology), Jianliang Xu (Hong Kong Baptist University) Interactive graph search leverages human intelligence to categorize target labels in a hierarchy, which are useful for image classification, product categorization, and database search. However, many existing studies of interactive graph search aim at identifying a single target optimally, and suffer from the limitations of asking too many questions and not being able to handle multiple targets. To address these two limitations, in this paper, we study a new problem of Budget constrained Interactive Graph Search for Multiple targets called lBM-IGS-problem. Specifically, given a set of multiple targets T in a hierarchy, and two parameters k and b, the goal is to identify a k-sized set of selections S such that the closeness between selections S and targets T is as small as possible, by asking at most a budget of b questions. We theoretically analyze the updating rules and design a penalty function to capture the closeness between selections and targets. To tackle the kBM-IGS-problem, we develop a novel framework to ask questions using the best vertex with the largest expected gain, which makes a balanced trade-off between target probability and benefit gain. Based on the kBM-IGS framework, we then propose an efficient algorithm STBIS to tackle a special case of the SingleTarget problem. Furthermore, we propose a dynamic programming based method kBM-DP to tackle kBM-IGS. To further improve efficiency, we propose two approximate methods kBM-Topk and kBM-DP+. Extensive experiments on large real-world datasets with ground-truth targets verify both the effectiveness and efficiency of our proposed algorithms.

11:00 – 12:00 CESTResearch Session 23: Special Hardware Chaired by Zsolt Istvan

Efficient Join Algorithms For Large Database Tables in a Multi-GPU Environment [Download Paper] Ran Rui (University of South Florida), Hao Li (University of South Florida), Yi-Cheng Tu (University of South Florida) Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.9X over best known code developed for multi-core CPUs and GPUs respectively.

Improving Execution Efficiency of Just-in-time Compilation based Query Processing on GPUs [Download Paper] Johns Paul (NUS), Bingsheng He (National University of Singapore), Shengliang Lu (National University of Singapore), Chiew Tong Lau (Nanyang Technological University) In recent years, we have witnessed significant efforts to improve the performance of Online Analytical Processing (OLAP) on graphics processing units (GPUs). Most existing studies have focused on improving memory efficiency since memory stalls can play an essential role in query processing performance on GPUs. Motivated by the recent rise of just-in-time (JIT) compilation in query processing, we investigate whether and how we can further improve query processing performance on GPU. Specifically, we study the execution of state-of-the-art JIT compile-based query processing systems. We find that thanks to advanced techniques such as database compression and JIT compilation, memory stalls are no longer the most significant bottleneck. Instead, current JIT compile-based query processing encounters severe under-utilization of GPU hardware due to divergent execution and degraded parallelism arising from resource contention. To address these issues, we propose a JIT compile-based query engine named Pyper to improve GPU utilization during query execution. Specifically, Pyper has two new operators, Shuffle and Segment, for query plan transformation, which can be plugged into a physical query plan in order to reduce divergent execution and resolve resource contention, respectively. To determine the insertion points for these two operators, we present an analytical model that helps insert Shuffle and Segment operators into a query plan in a cost-based manner. Our experiments show that 1) the analytical analysis of divergent execution and resource contention helps to improve the accuracy of the cost model, 2) the techniques proposed in this study help Pyper improve the performance of TPC-H and SSB queries on average by 1.60x and 1.52x, respectively.

Scotch: Generating FPGA-Accelerators for Sketching at Line Rate [Download Paper] Martin Kiefer (TU Berlin), Ilias Poulakis (TU Berlin), Sebastian Bress (TU Berlin), Volker Markl (Technische Universität Berlin) Sketching algorithms are a powerful tool for single-pass data summarization. Their numerous applications include approximate query processing, machine learning, and large-scale network monitoring. In the presence of high-bandwidth interconnects or in-memory data, the throughput of summary maintenance over input data becomes the bottleneck. While FPGAs have shown admirable throughput and energy-efficiency for data processing tasks, developing FPGA accelerators requires a sophisticated hardware design and expensive manual tuning by an expert.
We propose Scotch, a novel system for accelerating sketch maintenance using FPGAs. Scotch provides a domain-specific language for the user-friendly, high-level definition of a broad class of sketching algorithms. A code generator performs the heavy-lifting of hardware description, while an auto-tuning algorithm optimizes the summary size. Our evaluation shows that FPGA accelerators generated by Scotch outperform CPU- and GPU-based sketching by up to two orders of magnitude in terms of throughput and up to a factor of five in terms of energy efficiency.

11:00 – 12:00 CESTResearch Session 24: Efficient Query Processing Chaired by Peter Boncz

Decomposed Bounded Floats for Fast Compression and Queries [Download Paper] Chunwei Liu (University of Chicago), Hao Jiang (University of Chicago), John Paparrizos (University of Chicago), Aaron J Elmore (University of Chicago) Modern data-intensive applications often generate large amounts of low precision float data with a limited range of values. Despite the prevalence of this data category, there is a lack of an effective solution to ingest, store, and analyze such bounded, low-precision, numeric data. To address this gap, we propose Buff, a new compression technique that uses a decomposed columnar storage and encoding methods to provide effective compression, fast ingestion, and high-speed in-situ adaptive query operators with SIMD support.

Phoebe: A Learning-based Checkpoint Optimizer [Download Paper] Yiwen Zhu (Microsoft), Matteo Interlandi (Microsoft), Abhishek Roy (Microsoft), Krishnadhan Das (Microsoft), Hiren Patel (Microsoft), Malay Bag (Facebook), Hitesh Sharma (Google), Alekh Jindal (Microsoft) Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft.
To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment.
We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.

FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS [Download Paper] Yifei Yang (University of California, San Diego), Matt Youill (Burnian), Matthew Woicik (MIT), Yizhou Liu (University of Wisconsin, Madison), Xiangyao Yu (University of Wisconsin-Madison), Marco Serafini (University of Massachusetts Amherst), Ashraf Aboulnaga (QCRI), Michael Stonebraker (MIT) Modern cloud databases adopt a storage-disaggregation architecture that separates the management of computation and storage. A major bottleneck in such an architecture is the network connecting the computation and storage layers. Two solutions have been explored to mitigate the bottleneck: caching and computation pushdown. While both techniques can significantly reduce network traffic, existing DBMSs consider them as orthogonal techniques and support only one or the other, leaving potential performance benefits unexploited. In this paper we present FlexPushdownDB (FPDB), an OLAP cloud DBMS prototype that supports fine-grained hybrid query execution to combine the benefits of caching and computation pushdown in a storage-disaggregation architecture. We build a hybrid query executor based on a new concept called separable operators to combine the data from the cache and results from the pushdown processing. We also propose a novel Weighted-LFU cache replacement policy that takes into account the cost of pushdown computation. Our experimental evaluation on the Star Schema Benchmark shows that the hybrid execution outperforms both the conventional caching-only architecture and pushdown-only architecture by 2.2×. In the hybrid architecture, our experiments show that Weighted-LFU can outperform the baseline LFU by 37%.

11:00 – 12:00 CESTResearch Session 25: Data Synthesis and Labeling Chaired by Damianos Chatziantoniou

Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search [Download Paper] Jnwen Yang (The UNIVERSITY of CHICAGO), Yeye He (Microsoft Research), Surajit Chaudhuri (Microsoft) Recent work have made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data-pipelines with both string-transformations and table-manipulation operators.
We propose a novel by-target-schema paradigm that allows users to easily specify a desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target-schema, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from a desired pipeline on the given input should schematically "look like". While such a specification is easy for users to provide, it is fuzzy in nature, making the problem seemingly under-specified. Our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space of feasible solutions and make successful synthesis possible. We develop a \sj{} system that leverages search and deep reinforcement-learning (DRL) based algorithms that learns to synthesize pipelines via "self-play" on pipelines authored by human experts (reminiscent of self-play learning in Atari and AlphaGo). Experiments on a large number of real pipelines (extracted from Jupyter Notebooks on GitHub and commercial vendors) suggest that \sj{} can successfully synthesize 60-70\% of these complex pipelines with up to 10 pipeline steps in 10-20 seconds.

Kamino: Constraint-Aware Differentially Private Data Synthesis [Download Paper] Chang Ge (University of Waterloo), Shubhankar Mohapatra (University of Waterloo), Xi He (University of Waterloo), Ihab F Ilyas (U. of Waterloo) Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved.
This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.

Inspector Gadget: A Data Programming-based Labeling System for Industrial Images [Download Paper] Geon Heo (KAIST), Yuji Roh (KAIST), Seonghyeon Hwang (KAIST), Dayun Lee (KAIST), Steven Whang (KAIST) As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed where only a small portion is problematic (e.g., identifying defects on a surface). Since manual labeling these images is expensive, weak supervision is an attractive alternative where the idea is to generate weak labels that are not perfect, but can be produced at scale. Data programming is a recent paradigm in this category where it uses human knowledge in the form of labeling functions and combines them into a generative model. Data programming has been successful in applications based on text or structured data and can also be applied to images usually if one can find a way to convert them into structured data. In this work, we expand the horizon of data programming by directly applying it to images without this conversion, which is a common scenario for industrial applications. We propose Inspector Gadget, an image labeling system that combines crowdsourcing, data augmentation, and data programming to produce weak labels at scale for image classification. We perform experiments on real industrial image datasets and show that Inspector Gadget obtains better performance than other weak-labeling techniques: Snuba, GOGGLES, and self-learning baselines using convolutional neural networks (CNNs) without pre-training.

11:00 – 12:00 CESTResearch Session 26: Deep Learning I Chaired by Hazar Harmouch

LANCET: Labeling Complex Data at Scale [Download Paper] Huayi Zhang (WPI), Lei Cao (MIT), Samuel Madden (MIT), Elke A Rundensteiner (Worcester Polytechnic Institute) Cutting-edge machine learning techniques such as deep learning are label thirsty, often requiring millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following research questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. These three questions are not only each challenging in their own right, but they also correspond to tightly interdependent problems. Yet existing techniques from weak supervision to active learning provide at best isolated solutions to a subset of these challenges. In this work, we propose the first approach, called LANCET, that successfully addresses all three challenges in an integrated framework. LANCET is based on a solid theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model, namely the Covariate-shift and the Continuity conditions. First, guided by the Covariate-shift condition, LANCET maps raw input data into a semantic feature space, where an unlabeled object is expected to share the same label with its near-by labeled neighbor. Next, guided by the Continuity condition, LANCET selects objects for labeling, aiming to ensure that unlabeled objects always have some sufficiently close labeled neighbors. These two strategies jointly maximize the accuracy of the automatically produced labels and the prediction accuracy of the machine learning models trained on these labels. Lastly, LANCET uses a distribution matching network to verify whether both the Covariate-shift and Continuity conditions hold, in which case it would be safe to terminate the labeling process. Our experiments on diverse public data sets demonstrate that LANCET outperforms the state-of-the-art methods from Snuba to GOGGLES and other baselines by a large margin -- up to 30 percentage points increase in accuracy.

Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics [Download Paper] Daniel Kang (Stanford University), Ankit Mathur (Stanford University), Teja Veeramacheneni (Stanford University), Peter D Bailis (Stanford University), Matei Zaharia (Stanford and Databricks) While deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in visual analytics systems on modern accelerators. Through a novel measurement study, we show that the preprocessing of data (e.g., decoding, resizing) can be the bottleneck in many visual analytics systems on modern hardware.
To address the bottleneck of preprocessing, we introduce two optimizations for end-to-end visual analytics systems. First, we introduce novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data. Second, we develop a runtime engine for efficient visual DNN inference. This runtime engine a) efficiently pipelines preprocessing and DNN execution for inference, b) places preprocessing operations on the CPU or GPU in a hardware- and input-aware manner, and c) efficiently manages memory and threading for high throughput execution. We implement these optimizations in a novel system, Smol, and evaluate Smol on eight visual datasets. We show that its optimizations can achieve up to 5.9x end-to-end throughput improvements at a fixed accuracy over recent work in visual analytics.

EdgeDIPN: a Unified Deep Intent Prediction Network Deployed at the Edge [Download Paper] Long Guo (Alibaba inc.), Lifeng Hua (Alibaba Group), Rongfei Jia (Alibaba Group), Fei Fang (Alibaba Group), Binqiang Zhao (Alibaba), Bin Cui (Peking University) With the rapid growth of e-commerce in recent years, e-commerce platforms are becoming a primary place for people to find, compare and ultimately purchase products. To improve online shopping experience for consumers and increase sales for sellers, it is important to understand user intent accurately and be notified of its change timely. In this way, the right information could be offered to the right person at the right time. To achieve this goal, we propose a unified deep intent prediction network, named EdgeDIPN, which is deployed at the edge, i.e., mobile device, and able to monitor multiple user intent with different granularity simultaneously in real-time. We propose to train EdgeDIPN with multi-task learning, by which EdgeDIPN can share representations between different tasks for better performance and saving edge resources in the meantime. In particular, we propose a novel task-specific attention mechanism which enables different tasks to pick out the most relevant features from different data sources. To extract the shared representations more effectively, we utilize two kinds of attention mechanisms, where the multi-level attention mechanism tries to identify the important actions within each data source and the inter-view attention mechanism learns the interactions between different data sources. In the experiments conducted on a large-scale industrial dataset, EdgeDIPN significantly outperforms the baseline solutions. Moreover, EdgeDIPN has been deployed in the operational system of Alibaba. Online A/B testing results in several business scenarios reveal the potential of monitoring user intent in real-time. To the best of our knowledge, EdgeDIPN is the first full-fledged real-time user intent understanding center deployed at the edge and serving hundreds of millions of users in a large-scale e-commerce platform.

11:00 – 12:00 CESTResearch Session 27: Graph Processing I Chaired by Wim Martens

EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal in GPUs [Download Paper] Seung Won Min (University of Illinois at Urbana-Champaign), Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign), Zaid Qureshi (University of Illinois at Urbana-Champaign), Jinjun Xiong (IBM Thomas J. Watson Research Center), Eiman Ebrahimi (NVIDIA), Wen-Mei Hwu (University of Illinois at Urbana-Champaign) Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as compressed sparse row (CSR). To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data preprocessing/partitioning or unified virtual memory (UVM) to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cache-line-sized access to data stored in host memory.
This paper addresses the open question of whether a sufficiently large number of overlapping cache-line-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth with direct cache-line accesses to the host memory. EMOGI achieves 2.60X speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.

Efficiently Answering Reachability and Path Queries on Temporal Bipartite Graphs [Download Paper] Xiaoshuang Chen (University of New South Wales), Kai Wang (University of New South Wales), Xuemin Lin (University of New South Wales), Wenjie Zhang (University of New South Wales), Lu Qin (UTS), Ying Zhang (University of Technology Sydney) Bipartite graphs are naturally used to model relationships between two different types of entities, such as people-location, author-paper, and customer-product. When modeling real-world applications like disease outbreaks, edges are often enriched with temporal information, leading to temporal bipartite graphs. While reachability has been extensively studied on (temporal) unipartite graphs, it remains largely unexplored on temporal bipartite graphs. To fill this research gap, in this paper, we study the reachability problem on temporal bipartite graphs. Specifically, a vertex u reaches a vertex w in a temporal bipartite graph G if u and w are connected through a series of consecutive wedges with time constraints. Towards efficiently answering if a vertex can reach the other vertex, we propose an index-based method by adapting the idea of 2-hop labeling. Effective optimization strategies and parallelization techniques are devised to accelerate the index construction process. To better support real-life scenarios, we further show how the index is leveraged to efficiently answer other types of queries, e.g., single-source reachability query and earliest-arrival path query. Extensive experiments on 16 real-world graphs demonstrate the effectiveness and efficiency of our proposed techniques.

Efficient Bi-triangle Counting for Large Bipartite Networks [Download Paper] Yixing Yang (University of New South Wales), Yixiang Fang (School of Data Science, The Chinese University of Hong Kong, Shenzhen), Maria Orlowska (Polish-Japanese Institute of Information Technology), Wenjie Zhang (University of New South Wales), Xuemin Lin (University of New South Wales) A bipartite network is a network with two disjoint vertex sets and its edges only exist between vertices from different sets. It has received much interest since it can be used to model the relationship between two different sets of objects in many applications (e.g., the relationship between users and items in E-commerce). In this paper, we study the problem of efficient bi-triangle counting for a large bipartite network, where a bi-triangle is a cycle with three vertices from one vertex set and three vertices from another vertex set. Counting bi-triangles has found many real applications such as computing the transitivity coefficient and clustering coefficient for bipartite networks. To enable efficient bi-triangle counting, we first develop a baseline algorithm relying on the observation that each bi-triangle can be considered as the join of three wedges. Then, we propose a more sophisticated algorithm which regards a bi-triangle as the join of two super-wedges, where a wedge is a path with two edges while a super-wedge is a path with three edges. We further optimize the algorithm by ranking vertices according to their degrees. We have performed extensive experiments on both real and synthetic bipartite networks, where the largest one contains more than one billion edges, and the results show that the proposed solutions are up to five orders of magnitude faster than the baseline method.

13:30 – 14:30 CESTResearch Session 28: Data Streams II Chaired by Tyler Akidau

Real-Time Distance-Based Outlier Detection in Data Streams [Download Paper] Luan V Tran (University of Southern California), Min Mun (University of Southern California), Cyrus Shahabi (Computer Science Department. University of Southern California) Real-time outlier detection in data streams has drawn much attention recently as many applications need to be able to detect abnormal behaviors as soon as they occur. The arrival and departure of streaming data on edge devices impose new challenges to process the data quickly in real-time due to memory and CPU limitations of these devices. Existing methods are slow and not memory efficient as they mostly focus on quick detection of inliers and pay less attention to expediting neighbor searches for outlier candidates. In this study, we propose a new algorithm, CPOD, to improve the efficiency of outlier detections while reducing its memory requirements. CPOD uses a unique data structure called "core point" with multi-distance indexing to both quickly identify inliers and reduce neighbor search spaces for outlier candidates. We show that with six real-world and one synthetic dataset, CPOD is, on average, 10, 19, and 73 times faster than M_MCOD, NETS, and MCOD, respectively, while consuming low memory.

SAND: Streaming Subsequence Anomaly Detection [Download Paper] Paul Boniol (Université de Paris), John Paparrizos (University of Chicago), Themis Palpanas (University of Paris), Michael Franklin (University of Chicago) Subsequence anomaly detection in long sequences is an important problem with applications in a wide range of domains. With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In addition, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these problems, we propose SAND, a novel online method suitable for domain-agnostic anomaly detection. The key idea of SAND is the detection of anomalies based on their (dis)similarity to a model that represents normal behavior. SAND relies on a novel steaming methodology to incrementally update such model, which adapts to distribution drifts and omits obsolete data. The experimental results on several real datasets demonstrate that SAND correctly identifies single and recurrent anomalies of various types, without prior knowledge of the characteristics of these anomalies. In addition, SAND outperforms by a large margin the current state-of-the-art algorithms in terms of accuracy, while achieving orders of magnitude speedups.

Randomized Error Removal for Online Spread Estimation in Data Streaming [Download Paper] Haibo Wang (University of Florida), Chaoyi Ma (University of Florida), Olufemi O Odegbile (University of Florida), Shigang Chen (University of Florida), Jih-Kwon Peir (University of Florida) Measuring flow spread in real time from large, high-rate data streams has numerous practical applications, where a data stream is modeled as a sequence of data items from different flows and the spread of a flow is the number of distinct items in the flow. Past decades have witnessed tremendous performance improvement for single-flow spread estimation. However, when dealing with numerous flows in a data stream, it remains a significant challenge to measure per-flow spread accurately while reducing memory footprint. The goal of this paper is to introduce new multi-flow spread estimation designs that incur much smaller processing overhead and query overhead than the state of the art, yet achieves significant accuracy improvement in spread estimation. We formally analyze the performance of these new designs. We implement them in both hardware and software, and use real-world data traces to evaluate their performance in comparison with the state of the art. The experimental results show that our best sketch significantly improves over the best existing work in terms of estimation accuracy, data item processing throughput, and online query throughput.

13:30 – 14:30 CESTResearch Session 29: Persistent Memory II Chaired by Alberto Lerner

Understanding the Idiosyncrasies of Real Persistent Memory [Download Paper] Shashank Gugnani (The Ohio State University), Arjun Kashyap (The Ohio State University), Xiaoyi Lu (The Ohio State University) High capacity persistent memory (PMEM) is finally commercially available in the form of Intel's Optane DC Persistent Memory Module (DCPMM). Researchers have raced to evaluate and understand the performance of DCPMM itself as well as systems and applications designed to leverage PMEM resulting from over a decade of research. Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, memristor, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture.
In this paper, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.

Persistent Memory Hash Indexes: An Experimental Evaluation [Download Paper] Daokun Hu (College of Computer Science and Electronic Engineering, Hunan University, China), Zhiwen Chen (Hunan University), Jianbing Wu (Peking University Shenzhen Graduate School), Jianhua Sun (College of Computer Science and Electronic Engineering, Hunan University, China), Hao Chen (College of Computer Science and Electronic Engineering, Hunan University, China) Persistent memory (PM) is increasingly being leveraged to build hash-based indexing structures featuring cheap persistence, high performance, and instant recovery, especially with the recent release of Intel Optane DC Persistent Memory Modules. However, most of them are evaluated on DRAM-based emulators with unreal assumptions, or focus on the evaluation of specific metrics with important properties sidestepped. Thus, it is essential to understand how well the proposed hash indexes perform on real PM and how they differentiate from each other if a wider range of performance metrics are considered.
To this end, this paper provides a comprehensive evaluation of persistent hash tables. In particular, we focus on the evaluation of six state-of-the-art hash tables including Level hashing, CCEH, Dash, PCLHT, Clevel, and SOFT, with real PM hardware. Our evaluation was conducted using a unified benchmarking framework and representative workloads. Besides characterizing common performance properties, we also explore how hardware configurations (such as PM bandwidth, CPU instructions, and NUMA) affect the performance of PM-based hash tables. With our in-depth analysis, we identify design trade-offs and good paradigms in prior arts, and suggest desirable optimizations and directions for the future development of PM-based hash tables.

Viper: An Efficient Hybrid PMem-DRAM Key-Value Store [Download Paper] Lawrence Benson (Hasso Plattner Institute, University of Potsdam), Hendrik Makait (Hasso Plattner Institute, University of Potsdam), Tilmann Rabl (HPI, University of Potsdam) Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4--18x for write workloads, while matching or surpassing their get performance.

13:30 – 14:30 CESTResearch Session 30: AutoML Chaired by Wolfgang Lehner

VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition [Download Paper] Yang Li (Peking University), Yu Shen (Peking University), Wentao Zhang (Peking University), Jiawei Jiang (ETH Zurich), Yaliang Li (Alibaba Group), Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group), Jingren Zhou (Alibaba Group), Zhi Yang (Peking University), Wentao Wu (Microsoft Research), Ce Zhang (ETH), Bin Cui (Peking University) End-to-end AutoML has attracted intensive interests from both academia and industry, which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model -- akin to the one supported by modern database systems -- to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.

How to Design Robust Algorithms using Noisy Comparison Oracle [Download Paper] Raghavendra Addanki (University of Massachusetts Amherst), Sainyam Galhotra (University of Massachusetts Amherst), Barna Saha (University of California, Berkeley) Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as $k$-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak supervision in the form of a comparison oracle that compares the relative distance between the queried points such as `Is point $u$ closer to $v$ or $w$ closer to $x$?'. However, it is possible that some queries are easier to answer than others using a comparison oracle. We capture this by introducing two different noise models called adversarial and probabilistic noise. In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search under these noise models. Building upon the techniques we develop for these problems, we give robust algorithms for $k$-center clustering and agglomerative hierarchical clustering. We prove that our algorithms achieve good approximation guarantees with a high probability and analyze their query complexity. We evaluate the effectiveness and efficiency of our techniques empirically on various real-world datasets.

Doing More with Less: Characterizing Dataset Downsampling for AutoML [Download Paper] Fatjon Zogaj (ETH Zurich), Jose P Cambronero Sanchez (MIT), Martin Rinard (MIT), Jürgen Cito (TU Wien and MIT) Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small datasets, the same procedure does not directly scale to larger datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling -- reducing the number of rows in the input tabular dataset -- has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.

13:30 – 14:30 CESTResearch Session 31: Implementing DBMS Chaired by Ippokratis Pandis

Scaling Replicated State Machines with Compartmentalization [Download Paper] Michael J Whittaker (UC Berkeley), Ailidani Ailidani (Microsoft), Aleksey Charapko (University of New Hampshire), Murat Demirbas (University at Buffalo, SUNY), Neil Giridharan (UC Berkeley), Joseph M Hellerstein (UC Berkeley), Heidi Howard (University of Cambridge), Ion Stoica (UC Berkeley), Adriana Szekeres (VMware) State machine replication protocols, like MultiPaxos and Raft, are a critical component of many distributed systems and databases. However, these protocols offer relatively low throughput due to several bottlenecked components. Numerous existing protocols fix different bottlenecks in isolation but fall short of a complete solution. When you fix one bottleneck, another arises. In this paper, we introduce compartmentalization, the first comprehensive technique to eliminate state machine replication bottlenecks. Compartmentalization involves decoupling individual bottlenecks into distinct components and scaling these components independently. Compartmentalization has two key strengths. First, compartmentalization leads to strong performance. In this paper, we demonstrate how to compartmentalize MultiPaxos to increase its throughput by 6x on a write-only workload and 16x on a mixed read-write workload. Unlike other approaches, we achieve this performance without the need for specialized hardware. Second, compartmentalization is a technique, not a protocol. Industry practitioners can apply compartmentalization to their protocols incrementally without having to adopt a completely new protocol.

Hindsight Logging for Model Training [Download Paper] Rolando Garcia (UC Berkeley), Erick Liu (UC Berkeley), Vikram Sreekanti (UC Berkeley), Bobby Yan (UC Berkeley), Anusha Dandamudi (UC Berkeley), Joseph Gonzalez (UC Berkeley), Joseph M Hellerstein (UC Berkeley), Koushik Sen (University of California, Berkeley) In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.

In-Network Support for Transaction Triaging [Download Paper] Theo Jepsen (Stanford University), Alberto Lerner (University of Friborug), Fernando Pedone (Università della Svizzera italiana), Robert Soule (Yale University), Philippe Cudre-Mauroux (Exascale Infolab, Fribourg University) We introduce Transaction Triaging, a set of techniques that manipulate streams of transaction requests and responses while they travel to and from a database server. Compared to normal transaction streams, the triaged ones execute faster once they reach the database. The triaging algorithms do not interfere with the transaction execution nor require adherence to any particular concurrency control method, making them easy to port across database systems.
Transaction Triaging leverages recent programmable networking hardware that can perform computations on in-flight data. We evaluate our techniques on an in-memory database system using an actual programmable hardware network switch. Our experimental results show that triaging brings enough performance gains to compensate for almost all networking overheads. In high-overhead network stacks such as UDP/IP, we see throughput improvements from 2.05× to 7.95×. In an RDMA stack, the gains range from 1.08× to 1.90× without introducing significant latency.

13:30 – 14:30 CESTResearch Session 32: Provenance Chaired by Martin Hentschel

Compact, Tamper-Resistant Archival of Fine-Grained Provenance [Download Paper] Nan Zheng (University of Pennsylvania), Zack Ives (University of Pennsylvania) Data provenance tools aim to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. Fine-grained provenance further enables sophisticated reasoning about why individual output results appear or fail to appear — aiding debugging and diagnosis. However, for provenance to be truly useful for reproducibility and auditing, we need a provenance archival system that ensures it is tamper-resistant, and that storing provenance collected over many queries and across time is efficient (i.e., it compresses repeated results). In this paper we study this problem, developing solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using a variety of workloads based on scientific and OLAP workloads.

Capturing and querying fine-grained provenance of preprocessing pipelines in data science [Download Paper] Adriane Chapman (University of Southampton), Paolo Missier (Newcastle University), Giulia Simonelli (Universita Roma Tre), Riccardo Torlone (Universita Roma Tre) Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.

Fine-Grained Lineage for Safer Notebook Interactions [Download Paper] Stephen Macke (University of Illinois at Urbana-Champaign), Aditya Parameswaran (University of California, Berkeley), Hongpu Gong (University of California, Berkeley), Doris Lee (UC Berkeley), Doris Xin (UC Berkeley), Andrew Head (University of California, Berkeley) Computational notebooks have emerged as the platform of choice for data science and analytical workflows, enabling rapid iteration and exploration. By keeping intermediate program state in memory and segmenting units of execution into so-called "cells", notebooks allow users to execute their workflows interactively and enjoy particularly tight feedback. However, as cells are added, removed, reordered, and rerun, this hidden intermediate state accumulates in a way that is not necessarily correlated with the code visible in the notebook's cells, making execution behavior difficult to reason about, and leading to errors and lack of reproducibility. We present NBSafety, a custom Jupyter kernel that uses runtime tracing and static analysis to automatically manage lineage associated with cell execution and global notebook state. NBSafety detects and prevents errors that users make during unaided notebook interactions, all while preserving the flexibility of existing notebook semantics. We evaluate NBSafety's ability to prevent erroneous interactions by replaying and analyzing 666 real notebook sessions. Of these, NBSafety identified 117 sessions with potential safety errors, and in the remaining 549 sessions, the cells that NBSafety identified as resolving safety issues were more than 7x more likely to be selected by users for re-execution compared to a random baseline, even though the users were not using NBSafety and were therefore not influenced by its suggestions.

13:30 – 14:30 CESTResearch Session 33: Database Applications Chaired by Jorge Anulfo Quiane Ruiz

On the String Matching with k Differences in DNA Databases [Download Paper] Yangjun Chen (University of Winnipeg), Hoang Hai Nguyen (University of Winnipeg) In this paper, we discuss an efficient and effective index mechanism for the string matching with \textit{k} differences, by which we will find all the substrings of a target string $y$ of length $n$ that align with a pattern string $x$ of length $m$ with not more than $k$ insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, $n$ is often on the order of millions or billions while $m$ is just a hundred or a thousand. The main idea of our method is to transform \textit{y} to a BWT-array as an index, denoted as \textit{BWT}(\textit{y}), and search \textit{x} against it. The time complexity of our method is bounded by O($k\cdot|T|$), where $T$ is a tree structure dynamically generated during a search of \textit{BWT}(\textit{y}). The average value of $|T|$ is bounded by O($|\Sigma|^2k}$), where $\Sigma$ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies for the cases of $k$ $\leq$ O(log$_{|\Sigma|}n$). The general working process consists of two steps. In the first step, $x$ is decomposed into a series of $l$ small subpatterns, and \textit{BWT}($y$) is utilized to speed up the process to figure out all the occurrences of such subpatterns with $\lfloor$$k$/$l$$\rfloor$ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match $x$, but with $k$ differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

A four-dimensional Analysis of Partitioned Approximate Filters [Download Paper] Tobias Schmidt (TUM), Maximilian Bandle (TUM), Jana Giceva (TU Munich) With today’s data deluge, approximate filters are particularly attractive to avoid expensive operations like remote data/disk accesses. Among the many filter variants available, it is non-trivial to find the most suitable one and its optimal configuration for a specific use-case. We provide open-source implementations for the most relevant filters (Bloom, Cuckoo, Morton, and Xor filters) and compare them in four key dimensions: the false-positive rate, space consumption, build, and lookup throughput.
We improve upon existing state-of-the-art implementations with a new optimization, radix-partitioning, boosting the build and lookup throughput for large filters by up to 9x and 5x. Our in-depth evaluation first studies the impact of all available optimizations separately before combining them to search for the optimal filter for specific use-cases. While register-blocked Bloom filters offer the highest throughput, the new Xor filters are best suited when optimizing for small filter sizes or low false-positive rates.

Unconstrained Submodular Maximization with Modular Costs: Tight Approximation and Application to Profit Maximization [Download Paper] Tianyuan Jin (National University of Singapore), Yu Yang (City University of Hong Kong), Renchi Yang (National University of Singapore), Jieming Shi (The Hong Kong Polytechnic University), Keke Huang (Nanyang Technological University), Xiaokui Xiao (National University of Singapore) Given a set V, the problem of unconstrained submodular maximization with modular costs (USM-MC) asks for a subset S \subseteq V that maximizes f(S) - c(S), where f is a non-negative, monotone, and submodular function that gauges the utility of S, and c is a non-negative and modular function that measures the cost of S. This problem finds applications in numerous practical scenarios, such as profit maximization in viral marketing on social networks.
This paper presents ROI-Greedy, a polynomial time algorithm for USM-MC that returns a solution S satisfying f(S) - c(S) >= f(S*) - c(S*) - c(S*)\ln(f(S*)/c(S*)), where S* is the optimal solution to USM-MC. To our knowledge, ROI-Greedy is the first algorithm that provides such a strong approximation guarantee. In addition, We show that this worst-case guarantee is tight, in the sense that no polynomial time algorithm can ensure f(S) - c(S) >= (1+\epsilon)(f(S*) - c(S*) - c(S*) \ln(f(S*)/c(S*)), for any \epsilon > 0. Further, we devise a non-trivial extension of ROI-Greedy to solve the profit maximization problem, where the precise value of f(S) for any set S is unknown and can only be approximated via sampling. Extensive experiments on benchmark datasets demonstrate that ROI-Greedy significantly outperforms competing methods in terms of the trade-off between efficiency and solution quality.

13:30 – 14:30 CESTResearch Session 34: Fairness & Explainability Chaired by Nargesian Fatemeh

Tailoring Data Source Distributions for Fairness-aware Data Integration [Download Paper] Fatemeh Nargesian (University of Rochester), Abolfazl Asudeh (University of Illinois at Chicago), H. V. Jagadish (University of Michigan) Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: in other words, that it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required.
In this paper, we study how to acquire such data in the most cost-effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal cost. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.

Automated Feature Engineering for Algorithmic Fairness [Download Paper] Ricardo Salazar (TU Berlin), Felix Neutatz (TU Berlin), Ziawasch Abedjan (Leibniz Universität Hannover) One of the fundamental problems of machine ethics is to avoid the perpetuation and amplification of discrimination through machine learning applications. In particular, it is desired to exclude the influence of attributes with sensitive information, such as gender or race, and other causally related attributes on the machine learning task. The state-of-the-art bias reduction algorithm Capuchin breaks the causality chain of such attributes by adding and removing tuples. However, this horizontal approach can be considered invasive because it changes the data distribution. A vertical approach would be to prune sensitive features entirely. While this would ensure fairness without tampering with the data, it could also hurt the machine learning accuracy. Therefore, we propose a novel multi-objective feature selection strategy that leverages feature construction to generate more features that lead to both high accuracy and fairness. On three well-known datasets, our system achieves higher accuracy than other fairness-aware approaches while maintaining similar or higher fairness.

Explaining Monotonic Ranking Functions [Download Paper] Abraham Gale (Rutgers University), Amelie Marian (Rutgers) Ranking functions are commonly used to assist in decision-making in a wide variety of applications. As the general public realizes the significant societal impacts of the widespread use of algorithms in decisions, there has been a push towards explainability and transparency in decision processes and results, as well as demands to justify the fairness of the processes. In this paper, we focus on providing metrics towards explainability and transparency of ranking functions, with a focus towards making the the ranking process understandable, {\em a priori}, so that decision-makers can make informed choices when designing their ranking selection process. We propose transparent participation metrics to clarify the ranking process, by assessing the contribution of each parameter used in the ranking function in the creation of the final ranked outcome, using information about the ranking functions themselves, as well as observations of the underlying distributions of the parameter values involved in the ranking. To evaluate the outcome of the ranking process, we propose diversity and disparity metrics to measure how similar the selected objects are to each other, and to the underlying data distribution. We evaluate the behavior of our metrics on synthetic data, as well as on data and ranking functions on two real-world scenarios: high school admissions and decathlon scoring.

16:15 – 17:45 CESTResearch Session 35: Differential Privacy I Chaired by Wolfgang Lehner

Optimizing Fitness-For-Use of Differentially Private Linear Queries [Download Paper] Yingtai Xiao (Pennsylvania State University), Zeyu Ding (Penn State), Yuxin Wang (Penn State), Danfeng Zhang (Penn State), Daniel Kifer (Penn State) In practice, differentially private data releases are designed to support a variety of applications. A data release is fit for use if it meets target accuracy requirements for each application. In this paper, we consider the problem of answering linear queries under differential privacy subject to per-query accuracy constraints. Existing practical frameworks like the matrix mechanism do not provide such fine-grained control (they optimize total error, which allows some query answers to be more accurate than necessary, at the expense of other queries that become no longer useful). Thus, we design a fitness-for-use strategy that adds privacy-preserving Gaussian noise to query answers. The covariance structure of the noise is optimized to meet the fine-grained accuracy requirements while minimizing the cost to privacy.

Data Synthesis via Differentially Private Markov Random Field [Download Paper] Kuntai Cai (National University of Singapore), Xiaoyu Lei (University of Connecticut), Jianxin Wei (National Univ. of Singapore), Xiaokui Xiao (National University of Singapore) This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D, and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of count queries and classification tasks conducted on the synthetic data generated.

Differentially Private Binary- and Matrix-Valued Data Query: An XOR Mechanism [Download Paper] Tianxi Ji (Case Western Reserve University), Pan Li (Case Western Reserve University), Emre Yilmaz (University of Houston-Downtown), Erman Ayday (Case Western Reserve University, Bilkent University), Yanfang Ye (Case Western Reserve University), Jinyuan Sun (The University of Tennessee, Knoxville) Differential privacy has been widely adopted to release continuous- and scalar-valued information on a database without compromising the privacy of individual data records in it. The problem of querying binary- and matrix-valued information on a database in a differentially private manner has rarely been studied. However, binary- and matrix-valued data are ubiquitous in real-world applications, whose privacy concerns may arise under a variety of circumstances. In this paper, we devise an exclusive or (XOR) mechanism that perturbs binary- and matrix-valued query result by conducting an XOR operation on the query result with calibrated noises attributed to a matrix-valued Bernoulli distribution. We first rigorously analyze the privacy and utility guarantee of the proposed XOR mechanism. Then, to generate the parameters in the matrix-valued Bernoulli distribution, we develop a heuristic approach to minimize the expected square query error rate under 𝜖-differential privacy constraint. Additionally, to address the intractability of calculating the probability density function (PDF) of this distribution and efficiently generate samples from it, we adapt an Exact Hamiltonian Monte Carlo based sampling scheme. Finally, we experimentally demonstrate the efficacy of the XOR mechanism by considering binary data classification and social network analysis, all in a differentially private manner. Experiment results show that the XOR mechanism notably outperforms other state-of-the-art differentially private methods in terms of utility (such as classification accuracy and 𝐹1 score), and even achieves comparable utility to the non-private mechanisms.

Answering Multi-Dimensional Range Queries under Local Differential Privacy [Download Paper] Jianyu Yang (Beijing University of Posts and Telecommunications), Tianhao Wang (Purdue University), Ninghui Li (Purdue University), Xiang Cheng (Beijing University of Posts and Telecommunications), Sen Su (Beijing University of Posts and Telecommunications) In this paper, we tackle the problem of answering multi-dimensional range queries under local differential privacy. There are three key technical challenges: capturing the correlations among attributes, avoiding the curse of dimensionality, and dealing with the large domains of attributes. None of the existing approaches satisfactorily deals with all three challenges. Overcoming these three challenges, we first propose an approach called Two-Dimensional Grids (TDG). Its main idea is to carefully use binning to partition the two-dimensional (2-D) domains of all attribute pairs into 2-D grids that can answer all 2-D range queries and then estimate the answer of a higher dimensional range query from the answers of the associated 2-D range queries. However, in order to reduce errors due to noises, coarse granularities are needed for each attribute in 2-D grids, losing fine-grained distribution information for individual attributes. To correct this deficiency, we further propose Hybrid-Dimensional Grids (HDG), which also introduces 1-D grids to capture finer-grained information on distribution of each individual attribute and combines information from 1-D and 2-D grids to answer range queries. To make HDG consistently effective, we provide a guideline for properly choosing granularities of grids based on an analysis of how different sources of errors are impacted by these choices. Extensive experiments conducted on real and synthetic datasets show that HDG can give a significant improvement over the existing approaches.

CGM: An Enhanced Mechanism for Streaming Data Collectionwith Local Differential Privacy [Download Paper] ergute bao (national university of singapore), Yin Yang (Hamad bin Khalifa University), Xiaokui Xiao (National University of Singapore), Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group) Local differential privacy (LDP) is a well-established privacy protection scheme for collecting sensitive data, which has been integrated into major platforms such as iOS, Chrome, and Windows. The main idea is that each individual randomly perturbs her data on her local device, and only uploads the noisy version to an untrusted data aggregator. This paper focuses on the collection of streaming data consisting of regular updates, e.g., daily app usage. Such streams, when aggregated over a large population, often exhibit strong autocorrelations, e.g., the average usage of an app usually does not change dramatically from one day to the next. To our knowledge, this property has been largely neglected in existing LDP mechanisms. Consequently, data collected with current LDP methods often exhibit unrealistically violent fluctuations due to the added noise, drowning the overall trend, as shown in our experiments.
This paper proposes a novel correlated Gaussian mechanism (CGM) for enforcing (epsilon, delta)-LDP on streaming data collection, which reduces noise by exploiting public-known autocorrelation patterns of the aggregated data. This is done through non-trivial modifications to the core of the underlying Gaussian Mechanism; in particular, CGM injects temporally correlated noise, computed through an optimization program that takes into account the given autocorrelation pattern, data value range, and utility metric. CGM comes with formal proof of correctness, and consumes negligible computational resources. Extensive experiments using real datasets from different application domains demonstrate that CGM achieves consistent and significant utility gains compared to the baseline method of repeatedly running the underlying one-shot LDP mechanism.

16:15 – 17:45 CESTResearch Session 36: Graph Management II Chaired by Riccardo Tommasini

Minimum Vertex Augmentation [Download Paper] Jianwen Zhao (Chinese University of Hong Kong), Yufei Tao (The Chinese University of Hong Kong) This paper introduces a class of graph problems named minimum vertex augmentation (MVA). Given an input graph $G$ where each vertex carries a binary color 0 or 1, we want to flip the colors of the fewest 0-vertices such that the subgraph induced by all the (original and new) 1-vertices satisfies a user-defined predicate $\pi$. In other words, the goal is to minimally augment the subset of 1-vertices to uphold the property pi. Different formulations of $\pi$ instantiate the framework into concrete problems at the core of numerous applications. We first describe a suite of techniques for solving MVA problems with strong performance guarantees, and then present a generic algorithmic paradigm that a user can instantiate to deal with ad-hoc MVA problems. The effectiveness and efficiency of our solutions are verified with an extensive experimental evaluation.

Automating Incremental Graph Processing with Flexible Memoization [Download Paper] Shufeng Gong (NorthEastern University), Chao Tian (Alibaba Grioup), Qiang Yin (Alibaba Group), Wenyuan Yu (Alibaba Group), Yanfeng Zhang (NorthEastern University), Liang Geng (Alibaba Group), Song Yu (NorthEastern University), Ge Yu (Northeast University), Jingren Zhou (Alibaba Group) The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph processing. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality.
In light of these challenges, we propose Ingress, an automated system for incremental graph processing. Ingress is able to incrementalize batch vertex-centric algorithms into their incremental counterparts as a whole, without the need of redesigned logic or data structures from users. Underlying Ingress is an automated incrementalization framework equipped with four different memoization policies, aiming to support all kinds of graph computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. Ingress chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, Ingress outperforms state-of-the-art incremental graph systems by 15.93× on average (up to 147.14×) in efficiency.

RapidMatch: A Holistic Approach to Subgraph Query Processing [Download Paper] Shixuan Sun (National University of Singapore), Xibo Sun (Hong Kong University of Science and Technology), Yulin Che (Hong Kong University of Science and Technology), Qiong Luo (Hong Kong University of Science and Technology), Bingsheng He (National University of Singapore) A subgraph query searches for all embeddings in a data graph that are identical to a query graph. Two kinds of algorithms, either graph exploration based or join based, have been developed for processing subgraph queries. Due to algorithmic and implementational differences, join-based systems can handle query graphs of a few vertices efficiently whereas exploration-based approaches typically process up to several tens of vertices in the query graph. In this paper, we first compare these two kinds of methods and prove that the complexity of result enumeration in state-of-the-art exploration-based methods matches that of the worst-case optimal join. Furthermore, we propose RapidMatch, a holistic subgraph query processing framework integrating the two approaches. Specifically, RapidMatch not only runs relational operators such as selections and joins, but also utilizes graph structural information, as in graph exploration, for filtering and join plan generation. Consequently, it outperforms the state of the art in both approaches on a wide range of query workloads.

Teseo and the Analysis of Structural Dynamic Graphs [Download Paper] Dean De Leo (Centrum Wiskunde & Informatica), Peter Boncz (CWI) We present Teseo, a new system for the storage and analysis of dynamic structural graphs in main-memory and the addition of transactional support. Teseo introduces a novel design based on sparse arrays, large arrays interleaved with gaps, and a fat tree, where the graph is ultimately stored. Our design contrasts with early systems for the analysis of dynamic graphs, which often lack transactional support and are anchored to a vertex table as a primary index. We claim that the vertex table implies several constraints, often neglected, that can actually impair the generality, the robustness and extension opportunities of these systems. We compare Teseo with other dynamic graph systems, showing a high resilience to workload and input changes, while achieving comparable, if not superior, throughputs in updates and latencies in raw scans.

ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms [Download Paper] Laxman Dhulipala (MIT CSAIL), Changwan Hong (Massachusetts Institute of Technology), Julian Shun (MIT) Connected components is a fundamental kernel in graph applications. The fastest existing multicore algorithms for graph connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting.
We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 12.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. To guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies. Finally, we show how the techniques in ConnectIt can be used to speed up two important graph applications: approximate minimum spanning forest and SCAN clustering.

19Aug

09:00 – 10:30 CESTResearch Session 37: Timeseries Chaired by Steffen Zeuch

Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series [Download Paper] Vincent Jacob (Ecole Polytechnique), Fei Song (Ecole Polytechnique), Arnaud Stiegler (Ecole Polytechnique), Bijan Rad (Ecole Polytechnique), Yanlei Diao (Ecole Polytechnique), Nesime Tatbul (Intel Labs and MIT) Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon’s dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams [Download Paper] Mourad Khayati (University of Fribourg), Ines Arous (University of Fribourg), Zakhar Tymchenko (University of Fribourg), Philippe Cudre-Mauroux (Exascale Infolab, Fribourg University) With the emergence of the Internet of Things (IoT), time series streams have become ubiquitous in our daily life. Recording such data is rarely a perfect process, as sensor failures frequently occur, yielding occasional blocks of data that go missing in multiple time series. These missing blocks do not only affect real-time monitoring but also compromise the quality of online data analyses. Effective streaming recovery (imputation) techniques either have a quadratic runtime complexity, which is infeasible for any moderately sized data, or cannot recover more than one time series at a time. In this paper, we introduce a new online recovery technique to recover multiple time series streams in linear time. Our recovery technique implements a novel incremental version of the Centroid Decomposition technique and reduces its complexity from quadratic to linear. Using this incremental technique, missing blocks are efficiently recovered in a continuous manner based on previous recoveries. We formally prove the correctness of our new incremental computation, which yields an accurate recovery. Our experimental results on real-world time series show that our recovery technique is, on average, 30% more accurate than the state of the art while being vastly more efficient.

Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries [Download Paper] Zhiqi Wang (The Chinese University of HK), Jin Xue (The Chinese University of HK), Zili Shao (The Chinese University of Hong Kong) Performance-monitoring timeseries systems such as Prometheus and InfluxDB play a critical role in assuring reliability and operationality. These systems commonly adopt a column-oriented storage model, by which timeseries samples from different timeseries are separated, and all samples (with both numeric values and timestamps) in one timeseries are grouped into chunks and stored together. As a group of timeseries are often collected from the same source with the same timestamps, managing timestamps and metrics in a group manner provides more opportunities for query and insertion optimization but posts new challenges as well. Besides, for performance monitoring systems, to support better compression and efficient queries for most recent data that are most likely accessed by users, huge volumes of data are first cached in memory and then periodically flushed to disks. Periodical data flushing incurs high IO overhead, and simply discarding flushed data, which can still serve queries, not only is a waste but also brings huge memory reclamation cost. In this paper, we propose Heracles which integrates two techniques - (1) a new storage model, which enables efficient queries on compressed data by utilizing the shared timestamp column to easily locate corresponding metric values; (2) a novel two-level epoch-based memory manager, which allows the system to gradually flush and reclaim in-memory data while unreclaimed data can still serve queries. Heracles is implemented as a standalone module that can be easily integrated into existing performance monitoring timeseries systems. We have implemented a fully functional prototype with Heracles based on Prometheus tsdb, a representative open-source performance monitoring system, and conducted extensive experiments with real and synthetic timeseries data. Experimental results show that, compared with Prometheus, Heracles can improve the insertion throughput by 171%, and reduce the query latency and space usage by 32% and 30%, respectively, on average. Besides, to compare with other state-of-the-art storage techniques, we have integrated LevelDB (for LSM-tree-based structure) and Parquet (for column stores) into Prometheus tsdb, respectively, and experimental results show Heracles outperform these two integrations. We have released the open-source code of Heracles for public access.

FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data [Download Paper] YAN SHUYUAN (Alibaba), Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group), Wei Guo (Alibaba), Jingren Zhou (Alibaba Group), Zhewei Wei (Renmin University of China), Xiaowei Jiang (Alibaba Group), Sheng Xu (Alibaba Group) Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the part of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. We conduct experiments to evaluate our solution and compare it with alternatives on real data.

09:00 – 10:30 CESTResearch Session 38: Deep Learning II Chaired by Hazar Harmouch

DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks [Download Paper] Arif Usta (Bilkent University), Akifhan Karakayalı (Bilkent University), Özgür Ulusoy (Bilkent University) Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using ad-hoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called DBTagger (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases.

Analyzing and Mitigating Data Stalls in DNN Training [Download Paper] Jayashree Mohan (UT Austin), Amar Phanishayee (Microsoft Research), Ashish Raniwala (Microsoft), Vijay Chidambaram (UT Austin and VMWare) Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

Progressive Compressed Records: Taking a Byte out of Deep Learning Data [Download Paper] Michael Kuchnik (Carnegie Mellon University), George Amvrosiadis (Carnegie Mellon University), Virginia Smith (Carnegie Mellon University) Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50\% of the original encoding for many training tasks; (ii) it is possible to automatically and efficiently select appropriate compression levels for a given task; and (iii) PCRs enable tasks to readily access compressed data at runtime---utilizing as little as half the training bandwidth and thus potentially doubling training speed.

Towards an Optimized GROUP BY Abstraction for Large-Scale Machine Learning [Download Paper] Side Li (University of California, San Diego), Arun Kumar (University of California, San Diego) Many applications that use large-scale machine learning (ML) increasingly prefer different models for subgroups (e.g., countries) to improve accuracy, fairness, or other desiderata. We call this emerging popular practice learning over groups, analogizing to GROUP BY in SQL, albeit for ML training instead of SQL aggregates. From the systems standpoint, this practice compounds the already data-intensive workload of ML model selection (e.g., hyperparameter tuning). Often, thousands of models may need to be trained, necessitating high-throughput parallel execution. Alas, most ML systems today focus on training one model at a time or at best, parallelizing hyperparameter tuning. This status quo leads to resource wastage, low throughput, and high runtimes. In this work, we take the first step towards enabling and optimizing learning over groups from the data systems standpoint for three popular classes of ML: linear models, neural networks, and gradient-boosted decision trees. Analytically and empirically, we compare standard approaches to execute this workload today: task-parallelism and data-parallelism. We find neither is universally dominant. We put forth a novel hybrid approach we call grouped learning that avoids redundancy in communications and I/O using a novel form of parallel gradient descent we call Gradient Accumulation Parallelism (GAP). We prototype our ideas into a system we call Kingpin built on top of existing ML tools and the flexible massively-parallel runtime Ray. An extensive empirical evaluation on large ML benchmark datasets shows that Kingpin matches or is 4x to 14x faster than state-of-the-art ML systems, including Ray's native execution and PyTorch DDP.

Dealer: An End-to-End Model Marketplace with Differential Privacy [Download Paper] Jinfei Liu (Emory University/Georgia Institute of Technology), Jian Lou (Emory University), Junxu Liu (Emory University), Li Xiong (Emory University), Jian Pei (Simon Fraser University), Jimeng Sun (CS) Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace perspective and propose the first en\textbf{\underline{D}}-to-end mod\textbf{\underline{e}}l m\textbf{\underline{a}}rketp\textbf{\underline{l}}ace with diff\textbf{\underline{e}}rential p\textbf{\underline{r}}ivacy (\emph{Dealer}) towards answering the following questions: \emph{How to formulate data owners' compensation functions and model buyers' price functions? How can the broker determine prices for a set of models to maximize the revenue with arbitrage-free guarantee, and train a set of models with maximum data coverage given a manufacturing budget to remain competitive}? For the former, we propose compensation function for each data owner based on Shapley value and privacy sensitivity, and price function for each model buyer based on data coverage sensitivity and noise sensitivity. Both privacy sensitivity and noise sensitivity are measured by the level of differential privacy. For the latter, we formulate two optimization problems for model pricing and model training, and propose efficient dynamic programming algorithms. Experiment results on the real breast cancer dataset and synthetic datasets justify the design of \emph{Dealer} and verify the efficiency and effectiveness of the proposed algorithms.

09:00 – 10:30 CESTResearch Session 39: Indices Chaired by Boris Novikov

Benchmarking Learned Indexes [Download Paper] Ryan C Marcus (MIT), Andreas Kipf (MIT), Alexander van Renen (TUM), Mihail Stoian (TUM), Sanchit Misra (Intel), Alfons Kemper (TUM), Thomas Neumann (TUM), Tim Kraska (MIT) Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [Download Paper] Jialin Ding (MIT), Vikram Nathan (MIT), Mohammad Alizadeh (MIT CSAIL), Tim Kraska (MIT) Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

Cuckoo Index: A Lightweight Secondary Index Structure [Download Paper] Andreas Kipf (MIT), Damian Chromejko (Google), Alexander Hall (RelationalAI), Peter Boncz (Centrum Wiskunde & Informatica), David G Andersen (Carnegie Mellon University) In modern data warehousing, data skipping is essential for high query performance. While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns. Therefore, many systems rely on ap-proximate indexes such as min/max sketches (ZoneMaps) or Bloom ﬁlters for cost-effective data pruning. For example, Google PowerDrill skips more than 90% of data on average using such indexes. In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. With our approach, we tar-get equality predicates in a read-only (immutable) setting and optimize for space efficiency under the premise of practical build and lookup performance. In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in the data. CI allows to control the ratio of false positive partitions for lookups with non-occurring keys. Our experiments with real-world and synthetic data show that CI consumes significantly less space than per-partition ﬁlters for the same pruning power for low-to-medium cardinality columns. For high cardinality columns, CI is on par with its baselines.

Updatable Learned Index with Precise Positions [Download Paper] Jiacheng Wu (Tsinghua University), Yong Zhang ( Tsinghua University, China), Shimin Chen (Chinese Academy of Sciences), Yu Chen (Tsinghua University), Jin Wang (UCLA), Chunxiao Xing (Tsinghua University) Index plays an essential role in modern database engines to accelerate the query processing. The new paradigm of "learned index" has significantly changed the way of designing index structures in DBMS. The key insight is that indexes could be regarded as learned models that predict the position of a lookup key in the dataset. While such studies show promising results in both lookup time and index size, they cannot efficiently support update operations. Although recent studies have proposed some preliminary approaches to support update, they are at the cost of scarifying the lookup performance as they suffer from overheads brought by imprecise predictions in the leaf nodes. In this paper, we propose LIPP, a brand new framework of learned index to address such issues. Similar with state-of-the-art learned index structures, LIPP is able to support all kinds of index operations, namely lookup, insert, delete, update and bulkload. Meanwhile, we overcome the limitations of previous studies by properly extending the tree structure when dealing with update operations so as to eliminate the deviation of location predicted by models in the leaf nodes. Moreover, we further propose a dynamic adjustment strategy to ensure that the height of the tree index is tightly bounded and provide comprehensive theoretical analysis to illustrate it. We conduct an extensive set of experiments on several real-life datasets. The results demonstrate that our methods consistently outperform state-of-the-art solutions, achieving by up to 4 x for a broader class of workloads with different index operations.

Scalable Structural Index Construction for JSON Analytics [Download Paper] Lin Jiang (University of California, Riverside), Junqiao Qiu (Michigan Technological University), Zhijia Zhao (University of California, Riverside) JavaScript Object Notation (JSON) and its variants have gained great popularity in recent years. Unfortunately, the performance of their analytics is often dragged down by the expensive JSON parsing. To address this, recent work has shown that building bitwise indices on JSON data, called structural indices, can greatly accelerate querying. Despite its promise, the existing structural index construction does not scale well as records become larger and more complex, due to its (inherently) sequential construction process and the involvement of costly memory copies that grow as the nesting level increases. To address the above issues, this work introduces Pison – a more memory-efficient structural index constructor with supports of intra-record parallelism. First, Pison features a redesign of the bottleneck step in the existing solution. The new design is not only simpler but more memory-efficient. More importantly, Pison is able to build structural indices for a single bulky record in parallel, enabled by a group of customized parallelization techniques. Finally, Pison is also optimized for better data locality, which is especially critical in the scenario of bulky record processing. Our evaluation using real-world JSON datasets shows that Pison achieves 9.8X speedup (on average) over the existing structural index construction solution for bulky records and 4.6X speedup (on average) of end-to-end performance (indexing plus querying) over a state-of-the-art SIMD-based JSON parser on a 16-core machine.

09:00 – 10:30 CESTResearch Session 40: Data Integration Chaired by Nargesian Fatemeh

Deep Entity Matching with Pre-Trained Language Models [Download Paper] Yuliang Li (Megagon Labs), Jinfeng Li (Megagon Labs), Yoshihiko Suhara (Megagon Labs), AnHai Doan (University of Wisconsin-Madison), Wang-Chiew Tan (Facebook AI) We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto’s matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Deep Learning for Blocking in Entity Matching: A Design Space Exploration [Download Paper] Saravanan Thirumuruganathan (QCRI), Han Li (Amazon Alexa AI), Nan Tang (Qatar Computing Research Institute, HBKU), Mourad Ouzzani (Qatar Computing Research Institute, HBKU), Yash Govind (UW - Madison), Derek Paulsen (University of Wisconsin-Madison), Glenn M Fung (American Family Insurance), AnHai Doan (University of Wisconsin-Madison) Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL such as sequence modeling, transformer, and self supervision. We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

Dual-Objective Fine-Tuning of BERT for Entity Matching [Download Paper] Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim) An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

Discovering Related Data At Scale [Download Paper] Sagar Bharadwaj K S (Microsoft), Praveen Gupta (Microsoft Research), Ranjita Bhagwan (Microsoft Research), Saikat Guha (Microsoft Research) Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise's data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model that determines whether two columns across two data streams are related or not. We then use the model to find relations at scale across tens of millions of column-pairs and thereafter construct a data relationship graph in a scalable fashion, processing a data lake that has 4.5 Petabytes of data in approximately 80 minutes. Using manually labeled datasets as ground-truth, we show that our techniques show improvements of at least 23% when compared to state-of-the-art methods.

TQEL: Framework for Query-Driven Linking of Top-K Entities in Social Media Blogs [Download Paper] Abdulrahman Alsaudi (University of California Irvine), Yasser Altowim (King Abdulaziz City for Science and Technology), Sharad Mehrotra (U.C. Irvine), Yaming Yu (University of California Irvine) Social media analysis over blogs (such as tweets) often requires determining top-k mentions of a certain category (e.g., movies) in a collection (e.g., tweets collected over a given day). Such queries are complex since it requires entity linking function to be executed that is often expensive. We propose two methods, TQEL-exact and TQEL-approxiate, to retrieve top-k entities belonging to a certain category in a given collection. Both approaches attempt to restrict entity linking to mentions associated with entities that have a high chance of being in the top-k answer. While TQEL-exact continues the process until there are deterministic guarantees (the result set is exactly the same as in the case if we linked entities in all the tweets prior to the query), the TQEL-approximate stops early with probabilistic guarantees based on user specified threshold. While we describe both, we focus on TQEL-approximate since it is our main contribution. TQEL-approximate uses two statistical techniques (normal approximation and Monte-Carlo simulation) to guide the linking process and return an approximate answer with guarantees. Our experiments shows that TQEL-approximate provides significant savings in terms of entity linking compared to TQEL-exact and that both are orders of magnitude better compared to a naive approach that calls the entity linking on the entire dataset.

09:00 – 10:30 CESTResearch Session 41: Graph Neural Networks Chaired by Michael Loster

Accelerating Large Scale Real-Time GNN Inference using Channel Pruning [Download Paper] hongkuan zhou (University of Southern California), Ajitesh Srivastava (University of Southern California), Hanqing Zeng (USC), Rajgopal Kannan (University of Southern California), Viktor K Prasanna (Unversity of Southern California) Graph Neural Networks (GNNs) are proven to be powerful models to generate node embedding for downstream applications. However, due to the high computation complexity of GNN inference, it is hard to deploy GNNs for large-scale or real-time applications. In this paper, we propose to accelerate GNN inference by pruning the dimensions in each layer with negligible accuracy loss. Our pruning framework uses a novel LASSO regression formulation for GNNs to identify feature dimensions (channels) that have high influence on the output activation. We identify two inference scenarios and design pruning schemes based on their computation and memory usage for each. To further reduce the inference complexity, we effectively store and reuse hidden features of visited nodes, which significantly reduces the number of supporting nodes needed to compute the target embedding. We evaluate the proposed method with the node classification problem on five popular datasets and a real-time spam detection application. We demonstrate that the pruned GNN models greatly reduce computation and memory usage with little accuracy loss. For full inference, the proposed method achieves an average of 3.27x speedup with only 0.002 drop in F1-Micro on GPU. For batched inference, the proposed method achieves an average of 6.67x speedup with only 0.003 drop in F1-Micro on CPU. To the best of our knowledge, we are the first to accelerate large scale real-time GNN inference through channel pruning.

Efficient Streaming Subgraph Isomorphism with Graph Neural Networks [Download Paper] Chi Thang Duong (Ecole Polytechnique Federale de Lausanne), Dung Trung Hoang (Hanoi University of Science and Technology), Hongzhi Yin (The University of Queensland), Matthias Weidlich (Humboldt-Universität zu Berlin), Quoc Viet Hung Nguyen (Griffith University), Karl Aberer (EPFL) Queries to detect isomorphic subgraphs are important in graph-based data management. While the problem of subgraph isomorphism search has received considerable attention for the static setting of a single query, or a batch thereof, existing approaches do not scale to a dynamic setting of a continuous stream of queries. In this paper, we address the scalability challenges induced by a stream of subgraph isomorphism queries by caching and re-use of previous results. We first present a novel subgraph index based on graph embeddings that serves as the foundation for efficient stream processing. It enables not only effective caching and re-use of results, but also speeds-up traditional algorithms for subgraph isomorphism in case of cache misses. Moreover, we propose cache management policies that incorporate notions of reusability of query results. Experiments using real-world datasets demonstrate the effectiveness of our approach in handling isomorphic subgraph search for streams of queries.

Grain: Improving Data Efficiency of Graph Neural Networks via Diversified Influence Maximization [Download Paper] Wentao Zhang (Peking University), Zhi Yang (Peking University), YeXin Wang (Peking University), Yu Shen (Peking University), Yang Li (Peking University), Liang Wang (Alibaba group), Bin Cui (Peking University) Data selection methods, such as active learning and core-set selection, are useful tools for improving the data efficiency of deep learning models on large-scale datasets. However, recent deep learning models have moved forward from independent and identically distributed data to graph-structured data, such as social networks, e-commerce user-item graphs, and knowledge graphs. This evolution has led to the emergence of Graph Neural Networks (GNNs) that go beyond the models existing data selection methods are designed for. Therefore, we present GRAIN, an efficient framework that opens up a new perspective through connecting data selection in GNNs with social influence maximization. By exploiting the common patterns of GNNs, GRAIN introduces a novel feature propagation concept, a diversified influence maximization objective with novel influence and diversity functions, and a greedy algorithm with an approximation guarantee into a unified framework. Empirical studies on public datasets demonstrate that GRAIN significantly improves both the performance and efficiency of data selection (including active learning and core-set selection) for GNNs. To the best of our knowledge, this is the first attempt to bridge two largely parallel threads of research, data selection, and social influence maximization, in the setting of GNNs, paving new ways for improving data efficiency.

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture [Download Paper] Seung Won Min (University of Illinois at Urbana-Champaign), Kun Wu (University of Illinois at Urbana-Champaign), Sitao Huang (University of Illinois at Urbana-Champaign), Mert Hidayetoglu (University of Illinois at Urbana-Champaign), Jinjun Xiong (IBM Thomas J. Watson Research Center), Eiman Ebrahimi (NVIDIA), Deming Chen (University of Illinois at Urbana-Champaign), Wen-mei Hwu (NVIDIA Corporation) Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65-92\% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

Multi-Modal Transportation Recommendation with Unified Route Representation Learning [Download Paper] Hao LIU (Business Intelligence Lab, Baidu Research), Jindong Han (Baidu), Yanjie Fu (University of Central Florida), Jingbo Zhou (Baidu Inc.), Xinjiang Lu (Baidu Inc.), Hui Xiong (Rutgers University) Multi-modal transportation recommendation aims to provide the most appropriate travel route with various transportation modes according to certain criteria. After analyzing large-scale navigation data, we find that route representations exhibit two patterns: spatio-temporal autocorrelations within transportation networks and the semantic coherence of route sequences. However, there are few studies that consider both patterns when developing multi-modal transportation systems. To this end, in this paper, we study multi-modal transportation recommendation with unified route representation learning by exploiting both spatio-temporal dependencies in transportation networks and the semantic coherence of historical routes. Specifically, we propose to unify both dynamic graph representation learning and hierarchical multi-task learning for multi-modal transportation recommendations. Along this line, we first transform the multi-modal transportation network into time-dependent multi-view transportation graphs and propose a spatiotemporal graph neural network module to capture the spatial and temporal autocorrelation. Then, we introduce a coherent-aware attentive route representation learning module to project arbitrary-length routes into fixed-length representation vectors, with explicit modeling of route coherence from historical routes. Moreover, we develop a hierarchical multi-task learning module to differentiate route representations for different transport modes, and this is guided by the final recommendation feedback as well as multiple auxiliary tasks equipped in different network layers. Extensive experimental results on two large-scale real-world datasets demonstrate the performance of the proposed system outperforms eight baselines.

11:00 – 12:30 CESTResearch Session 42: Storage and in-memory DBMS Chaired by Andreas Kipf

Taurus: Lightweight Parallel Logging for In-Memory Database Management Systems [Download Paper] Yu Xia (MIT), Xiangyao Yu (University of Wisconsin-Madison), Andrew Pavlo (Carnegie Mellon University), Srinivas Devadas (MIT) Existing single-stream logging schemes are unsuitable for in-memory database management systems (DBMSs) as the single log is often a performance bottleneck. To overcome this problem, we present Taurus, an efficient parallel logging scheme that uses multiple log streams, and is compatible with both data and command logging. Taurus tracks and encodes transaction dependencies using a vector of log sequence numbers (LSNs). These vectors ensure that the de- pendencies are fully captured in logging and correctly enforced in recovery. Our experimental evaluation with an in-memory DBMS shows that Taurus’s parallel logging achieves up to 9.9× and 2.9× speedups over single-streamed data logging and command logging, respectively. It also enables the DBMS to recover up to 22.9× and 75.6× faster than these baselines for data and command logging, respectively. We also compare Taurus with two state-of-the-art parallel logging schemes and show that the DBMS achieves up to 2.8× better performance on NVMe drives and 9.2× on HDDs.

CoroBase: Coroutine-Oriented Main-Memory Database Engine [Download Paper] Yongjun He (Simon Fraser University), Jiacheng Lu (Simon Fraser University), Tianzheng Wang (Simon Fraser University) Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This paper presents CoroBase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2x better for read-intensive workloads, and remain competitive for workloads that inherently do not benefit from software prefetching.

Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs [Download Paper] Diego Didona (IBM Research - Zurich), Nikolas Ioannou (IBM Research - Zurich), Radu I Stoica (IBM Research - Zurich), Kornilios Kourtis (Isovalent) Solid-state drives (SSDs) are extensively used to deploy persistent data stores, as they provide low latency random access, high write throughput, high data density, and low cost. Tree-based data structures are widely used to build persistent data stores, and indeed they lie at the backbone of many of the data management systems used in production and research today. We show that benchmarking a persistent tree-based data structure on an SSD is a complex process, which may easily incur subtle pitfalls that can lead to an inaccurate performance assessment. At a high-level, these pitfalls stem from the interaction of complex software running on complex hardware. On the one hand, tree structures implement internal operations that have non-trivial effects on performance. On the other hand, SSDs employ firmware logic to deal with the idiosyncrasies of the underlying flash memory, which are well known to also lead to complex performance dynamics. We identify seven benchmarking pitfalls using RocksDB and WiredTiger, two widespread implementations of an LSM-Tree and a B+Tree, respectively. We show that such pitfalls can lead to incorrect measurements of key performance indicators, hinder the reproducibility and the representativeness of the results, and lead to suboptimal deployments in production environments. We also provide guidelines on how to avoid these pitfalls to obtain more reliable performance measurements, and to perform more thorough and fair comparisons among different design points.

An Analysis of Concurrency Control Protocols fo In-Memory Database with CCBench [Download Paper] Takayuki Tanabe (University of Tsukuba), Takashi Hoshino (Cybozu Labs), Hideyuki Kawashima (Keio University), Osamu Tatebe (University of Tsukuba) This paper presents yet another concurrency control analy-sis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the conﬁguration of seven workload parameters. We ana-lyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous studies focused on thread scalability and did not explore the space analyzed here. We classiﬁed the optimization methods on the basis of three performance factors: CPU cache, delay on conﬂict, and version lifetime. Analyses using CCBench and 224 threads, produced six insights. (I1) The perfor-mance of optimistic concurrency control protocol for a read-only workload rapidly degrades as cardinality increases even without L3 cache misses. (I2) Silo can outperform TicToc for some write-intensive workloads by using invisible reads optimization. (I3) The eﬀectiveness of two approaches to coping with conﬂict (wait and no-wait) depends on the situ-ation. (I4) OCC reads the same record two or more times if a concurrent transaction interruption occurs, which can im-prove performance. (I5) Mixing diﬀerent implementations is inappropriate for deep analysis. (I6) Even a state-of-the-art garbage collection method cannot improve the performance of multi-version protocols if there is a single long transaction mixed into the workload. On the basis of I4, we deﬁned the read phase extension optimization in which an artiﬁcial de-lay is added to the read phase. On the basis of I6, we deﬁned the aggressive garbage collection optimization in which even visible versions are collected. The code for CCBench and all the data in this paper are available online at GitHub.

Constructing and Analyzing the LSM Compaction Design Space [Download Paper] Subhadeep Sarkar (Boston University), Dimitris Staratzis (Boston University), Zichen Zhu (Boston University), Manos Athanassoulis (Boston University) Log-structured merge (LSM) trees offer efficient ingestion by appending incoming data, and thus, are widely used as the storage layer of production NoSQL data stores. To enable competitive read performance, LSM-trees periodically re-organize data to form a tree with levels of exponentially increasing capacity, through iterative compactions. Compactions fundamentally influence the performance of an LSM-engine in terms of write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. Hence, choosing the appropriate compaction strategy is crucial and, at the same time, hard as the LSM-compaction design space is vast, largely unexplored, and has not been formally defined in the literature. As a result, most LSM-based engines use a fixed compaction strategy, typically hand-picked by an engineer, which decides how and when to compact data. In this paper, we present the design space of LSM-compactions, and evaluate state-of-the-art compaction strategies with respect to key performance metrics. Toward this goal, our first contribution is to introduce a set of four design primitives that can formally define any compaction strategy: (i) the compaction trigger, (ii) the data layout, (iii) the compaction granularity, and (iv) the data movement policy. Together, these primitives can synthesize both existing and completely new compaction strategies. Our second contribution is to experimentally analyze 10 compaction strategies. We present 12 observations and 7 high-level takeaway messages, which show how LSM systems can navigate the compaction design space.

11:00 – 12:30 CESTResearch Session 43: Representation Learning Chaired by Michael Loster

Scaling Attributed Network Embedding to Massive Graphs [Download Paper] Renchi Yang (National University of Singapore), Jieming Shi (The Hong Kong Polytechnic University), Xiaokui Xiao (National University of Singapore), Yin Yang (Hamad bin Khalifa University), Juncheng Liu (National University of Singapore), Sourav S Bhowmick (Nanyang Technological University) Given a graph 𝐺 where each node is associated with a set of attributes, attributed network embedding (ANE) maps each node 𝑣∈𝐺 to a compact vector 𝑋𝑣, which can be used in downstream machine learning tasks. Ideally, 𝑋𝑣 should capture node 𝑣’s affinity to each attribute, which considers not only 𝑣’s own attribute associations, but also those of its connected nodes along edges in 𝐺. It is challenging to obtain high-utility embeddings that enable accurate predictions; scaling effective ANE computation to massive graphs with millions of nodes pushes the difficulty of the problem to a whole new level. Existing solutions largely fail on such graphs, leading to prohibitive costs, low-quality embeddings, or both. This paper proposes PANE, an effective and scalable approach to ANE computation for massive graphs that achieves state-of-the-art result quality on multiple benchmark datasets, measured by the accuracy of three common prediction tasks: attribute inference, link prediction, and node classification. In particular, for the large MAG data with over 59 million nodes, 0.98 billion edges, and 2000 attributes, PANE is the only known viable solution that obtains effective embeddings on a single server, within 12 hours. PANE obtains high scalability and effectiveness through three main algorithmic designs. First, it formulates the learning objective based on a novel random walk model for attributed networks. The resulting optimization task is still challenging on large graphs. Second, PANE includes a highly efficient solver for the above optimization problem, whose key module is a carefully designed initialization of the embeddings, which drastically reduces the number of iterations required to converge. Finally, PANE utilizes multi-core CPUs through non-trivial parallelization of the above solver, which achieves scalability while retaining the high quality of the resulting embeddings. Extensive experiments, comparing 10 existing approaches on 8 real datasets, demonstrate that PANE consistently outperforms all existing methods in terms of result quality, while being orders of magnitude faster.

FREDE: Anytime Graph Embeddings [Download Paper] Anton Tsitsulin (University of Bonn), Marina Munkhoeva (Skoltech), Davide Mottin (Aarhus University), Panagiotis Karras (Aarhus University), Ivan Oseledets (Skolkovo Institute of Science and Technology), Emmanuel Müller (University of Bonn & Fraunhofer IAIS) Low-dimensional representations, or embeddings, of a graph’s nodes facilitate several practical data science and data engineering tasks. As such embeddings rely, explicitly or implicitly, on a similarity measure among nodes, they require the computation of a quadratic similarity matrix, inducing a tradeoff between space complexity and embedding quality. To date, no graph embedding work combines (i) linear space complexity, (ii) a nonlinear transform as its basis, and (iii) nontrivial quality guarantees. In this paper we introduce FREDE (FREquent Directions Embedding), a graph embedding based on matrix sketching that combines those three desiderata. Starting out from the observation that embedding methods aim to preserve the covariance among the rows of a similarity matrix, FREDE iteratively improves on quality while individually processing rows of a nonlinearly transformed PPR similarity matrix derived from a state-of-the-art graph embedding method and provides, at any iteration, column-covariance approximation guarantees in due course almost indistinguishable from those of the optimal approximation by SVD. Our experimental evaluation on variably sized networks shows that FREDE performs almost as well as SVD and competitively against state-of-the-art embedding methods in diverse data science tasks, even when it is based on as little as 10% of node similarities.

Tensors: An abstraction for general data processing [Download Paper] Dimitrios Koutsoukos (ETHZ), Supun C Nakandala (University of California, San Diego), Konstantinos Karanasos (Microsoft), Karla Saur (Microsoft), Gustavo Alonso (ETHZ), Matteo Interlandi (Microsoft) Deep Learning (DL) has created a growing demand for simpler ways to develop complex models and efficient ways to execute them. Thus, a significant effort has gone into frameworks like Py- Torch or TensorFlow to support a variety of DL models and run efficiently and seamlessly over heterogeneous and distributed hard- ware. Since these frameworks will continue improving given the predominance of DL workloads, it is natural to ask what else can be done with them. This is not a trivial question since these frame- works are based on the efficient implementation of tensors, which are well adapted to DL but, in principle, to nothing else. In this paper we explore to what extent Tensor Computation Runtimes (TCRs) can support non-ML data processing applications, so that other use cases can take advantage of the investments made on TCRs. In particular, we are interested in graph processing and relational operators, two use cases very different from ML, in high demand, and complement quite well what TCRs can do today. Build- ing on Hummingbird, a recent platform converting traditional machine learning algorithms to tensor computations, we explore how to map selected graph processing and relational operator algorithms into tensor computations. Our vision is supported by the results: our code often outperforms custom-built C++ and CUDA kernels, while massively reducing the development effort, taking advantage of the cross-platform compilation capabilities of TCRs.

TURL: Table Understanding through Representation Learning [Download Paper] Xiang Deng (The Ohio State University), Huan Sun (Ohio State University), Alyssa Lees (Google), You Wu (Google), Cong Yu (Google) Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task-specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training/fine-tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning. Specifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, cell filling). We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.

11:00 – 12:30 CESTResearch Session 44: Distributed Systems Chaired by Alberto Lerner

Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation [Download Paper] Olga Poppe (Microsoft), Tayo Amuneke (Microsoft), Dalitso Banda (Microsoft), Aritra De (Microsoft), Ari Green (Microsoft), Manon Knoertzer (Microsoft), Ehi Nosakhare (Microsoft), Karthik Rajendran (Microsoft), Deepak Shankargouda (Microsoft), Meina Wang (Microsoft), Alan Au (Microsoft), Carlo Curino (Microsoft -- GSL), Qun Guo (Microsoft), Alekh Jindal (Microsoft), Ajay Kalhan (Microsoft), Morgan Oslake (Microsoft), Sonia Parchani (Microsoft), Vijay Ramani (Microsoft), Raj Sellappan (Microsoft), Saikat Sen (Microsoft), Sheetal Shrotri (Microsoft), Soundararajan Srinivasan (Microsoft), Ping Xia (Microsoft), Shize Xu (Microsoft), Alicia Yang (Microsoft), Yiwen Zhu (Microsoft) Microsoft Azure is dedicated to guarantee high quality of service to its customers, in particular, during periods of high customer activity, while controlling cost. We employ a Data Science (DS) driven solution to predict user load and leverage these predictions to optimize resource allocation. To this end, we built the Seagull infrastructure that processes per-server telemetry, validates the data, trains and deploys ML models. The models are used to predict customer load per server (24h into the future), and optimize service operations. Seagull continually re-evaluates accuracy of predictions, fallback to previously known good models and triggers alerts as appropriate. We deployed this infrastructure in production for PostgreSQL and MySQL servers across all Azure regions, and applied it to the problem of scheduling server backups during low-load time. This minimizes interference with user-induced load and improves customer experience.

ByShard: Sharding in a Byzantine Environment [Download Paper] Jelle Hellings (University of California Davis), Mohammad Sadoghi (University of California, Davis) The emergence of blockchains has fueled the development of resilient systems that can deal with Byzantine failures due to crashes, bugs, or even malicious behavior. Recently, we have also seen the exploration of sharding in these resilient systems, this to provide the scalability required by very large data-based applications. Unfortunately, current sharded resilient systems all use system-specific specialized approaches toward sharding that do not provide the flexibility of traditional sharded data management systems. To improve on this situation, we fundamentally look at the design of sharded resilient systems. We do so by introducing ByShard, a unifying framework for the study of sharded resilient systems. Within this framework, we show how two-phase commit and two-phase locking---two techniques central to providing atomicity and isolation in traditional sharded databases---can be implemented efficiently in a Byzantine environment, this with a minimal-usage of costly Byzantine resilient primitives. Based on these techniques, we propose eighteen multi-shard transaction processing methods. Finally, we practically evaluate these methods and show that each method supports high transaction throughput and provides scalability while each striking its own trade-off between isolation level, latency, and abort rate. As such, our work provides a strong foundation for the development of ACID-compliant general-purpose and flexible sharded resilient data management systems.

MorphoSys: Automatic Physical Design Metamorphosis for Distributed Database Systems [Download Paper] Michael Abebe (University of Waterloo), Brad Glasbergen (University of Waterloo), Khuzaima Daudjee (University of Waterloo) Distributed database systems are widely used to meet the demands of storing and managing computation-heavy work-loads. To boost performance and minimize resource and data contention, these systems require selecting a distributed physical design that determines where to place data, and which data items to replicate and partition. Deciding on a physical design is diﬃcult as each choice poses a trade-oﬀ in the design space, and a poor choice can signiﬁcantly de-grade performance. Current design decisions are typically static and cannot adapt to workload changes or are unable to combine multiple design choices such as data replication and data partitioning integrally. This paper presents Mor-phoSys, a distributed database system that dynamically chooses, and alters, its physical design based on the work-load. MorphoSys makes integrated design decisions for all of the data partitioning, replication and placement decisions on-the-ﬂy using a learned cost model. MorphoSys provides ef-ﬁcient transaction execution in the face of design changes via a novel concurrency control and update propagation scheme. Our experimental evaluation, using several benchmark work-loads and state-of-the-art comparison systems, shows that MorphoSys delivers excellent system performance through eﬀective and eﬃcient physical designs.

Trident: Task Scheduling over Tiered Storage Systems in Big Data Platforms [Download Paper] Herodotos Herodotou (Cyprus University of Technology), Elena Kakoulli (Cyprus University of Technology) The recent advancements in storage technologies have popularized the use of tiered storage systems in data-intensive compute clusters. The Hadoop Distributed File System (HDFS), for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, the task schedulers of big data platforms (such as Hadoop and Spark) will assign tasks to available resources only based on data locality information, and completely ignore the fact that local data is now stored on a variety of storage media with different performance characteristics. This paper presents Trident, a principled task scheduling approach that is designed to make optimal task assignment decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and uses a standard solver for finding the optimal solution. In addition, Trident utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. Trident is implemented in both Spark and Hadoop, and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.

Towards Cost-Optimal Query Processing in the Cloud [Download Paper] Viktor Leis (Friedrich-Alexander-Universität Erlangen-Nürnberg), Maximilian Kuschewski (Uni Augsburg, LMU, TUM) Public cloud providers offer hundreds of heterogeneous hardware instances. For analytical query processing systems, this presents a major challenge: depending on the hardware configuration, performance and cost may differ by orders of magnitude. We propose a simple and intuitive model that takes the workload, hardware, and cost into account to determine the optimal instance configuration. We discuss how such a model-based approach can significantly reduce costs and also guide the evolution of cloud-native database systems to achieve our vision of cost-optimal query processing.

Crystal: A Unified Cache Storage System for Analytical Databases [Download Paper] Dominik Durner (TUM), Badrish Chandramouli (Microsoft Research), Yinan Li (Microsoft Research) Cloud analytical databases employ a disaggregated storage model, where the elastic compute layer accesses data persisted on remote cloud storage in block-oriented columnar formats. Given the high latency and low bandwidth to remote storage and the limited size of fast local storage, caching data at the compute node is important and has resulted in a renewed interest in caching for analytics. Today, each DBMS builds its own caching solution, usually based on file- or block-level LRU. In this paper, we advocate a new architecture of a smart cache storage system called Crystal, that is co-located with compute. Crystal's clients are DBMS-specific "data sources" with push-down predicates. Similar in spirit to a DBMS, Crystal incorporates query processing and optimization components focusing on efficient caching and serving of single-table hyper-rectangles called regions. Results show that Crystal, with a small Spark data source connector, can significantly improve query latencies on unmodified Spark while also saving on bandwidth from remote storage.

11:00 – 12:30 CESTResearch Session 45: Encrypted Storage and Blockchain Chaired by Divesh Srivastava

CALYPSO: Private Data Management for Decentralized Ledgers [Download Paper] Eleftherios Kokoris Kogias (IST Austria), Enis Ceyhun Alp (EPFL), Linus Gasser (EPFL), Philipp Jovanovic (UCL), Ewa Syta (Trinity College), Bryan Ford (EPFL) Distributed ledgers provide high availability and integrity, making them a key enabler for practical and secure computation of distributed workloads among mutually distrustful parties. Many practical applications also require strong confidentiality, however. This work enhances permissioned and permissionless blockchains with the ability to manage confidential data without forfeiting availability or decentralization. The proposed CALYPSO architecture addresses two orthogonal challenges confronting modern distributed ledgers: (a) enabling the auditable management of secrets and (b) protecting distributed computations against arbitrage attacks when their results depend on the ordering and secrecy of inputs. CALYPSO introduces on-chain secrets, a novel abstraction that enforces atomic deposition of an auditable trace whenever users access confidential data. CALYPSO provides user-controlled consent management that ensures revocation atomicity and accountable anonymity. To enable permissionless deployment, we introduce an incentive scheme and provide users with the option to select their preferred trustees. We evaluated our CALYPSO prototype with a confidential document-sharing application and a decentralized lottery. Our benchmarks show that transaction-processing latency increases linearly in terms of security (number of trustees) and is in the range of 0.2 to 8 seconds for 16 to 128 trustees.

SlimChain: Scaling Blockchain Transactions through Off-Chain Storage and Parallel Processing [Download Paper] Cheng Xu (Hong Kong Baptist University), Ce Zhang (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University), Jian Pei (Simon Fraser University) Blockchain technology has emerged as the cornerstone of many decentralized applications operating among otherwise untrusted peers. However, it is well known that existing blockchain systems do not scale well. Transactions are often executed and committed sequentially in order to maintain the same view of the total order. Furthermore, it is necessary to duplicate both transaction data and their executions in every node in the blockchain network for integrity assurance. Such storage and computation requirements put significant burdens on the blockchain system, not only limiting system scalability but also undermining system security and robustness by making the network more centralized. To tackle these problems, in this paper, we propose SlimChain, a novel blockchain system that scales transactions through off-chain storage and parallel processing. Advocating a stateless design, SlimChain maintains only the short commitments of ledger states on-chain while dedicating transaction executions and data storage to off-chain nodes. To realize SlimChain, we propose new schemes for off-chain smart contract execution, on-chain transaction validation, and state commitment. We also propose optimizations to reduce network transmissions and a new sharding technique to improve system scalability further. Extensive experiments are conducted to validate the performance of the proposed SlimChain system. Compared with the existing systems, SlimChain reduces the on-chain storage requirements by 97%~99%, while also improving the peak throughput by 1.4X~15.6X.

Building Enclave-Native Storage Engines for Practical Encrypted Databases [Download Paper] Yuanyuan Sun (Alibaba Group), Sheng Wang (Alibaba Group), Huorong Li (Alibaba Group), Feifei Li (Alibaba Group) Data confidentiality is one of the biggest concerns that hinders enterprise customers from moving their workloads to the cloud. Thanks to the trusted execution environment (TEE), it is now feasible to build encrypted databases in the enclave that can process customers' data while keeping it confidential to the cloud. Though some enclave-based encrypted databases emerge recently, there remains a large unexplored area in between about how confidentiality can be achieved in different ways and what influences are implied by them. In this paper, we first provide a broad exploration of possible design choices in building encrypted database storage engines, rendering trade-offs in security, performance and functionality. We observe that choices on different dimensions can be independent and their combination determines the overall trade-off of the entire storage. We then propose Enclage, an encrypted storage engine that makes practical trade-offs. It adopts many enclave-native designs, such as page-level encryption, reduced enclave interaction, and hierarchical memory buffer, which offer high-level security guarantee and high performance at the same time. To make better use of the limited enclave memory, we derive the optimal page size in enclave and adopt delta decryption to access large data pages with low cost. Our experiments show that Enclage outperforms the baseline, a common storage design in many encrypted databases, by over 13x in throughput and about 5x in storage savings.

Cryptanalysis of An Encrypted Database in SIGMOD '14 [Download Paper] Xinle Cao (Zhejiang University), Jian Liu (Zhejiang University), Hao Lu (Zhejiang University), Kui Ren (Zhejiang University) Encrypted database is an innovative technology proposed to solve the data confidentiality issue in cloud-based DB systems. It allows a data owner to encrypt its database before uploading it to the service provider; and it allows the service provider to execute SQL queries over the encrypted data. Most of existing encrypted databases (e.g., CryptDB in SOSP '11) do not support data interoperability: unable to process complex queries that require piping the output of one operation to another. To the best of our knowledge, SDB (SIGMOD '14) is the only encrypted database that achieves data interoperability. Unfortunately, we found SDB is not secure! In this paper, we revisit the security of SDB and propose a cipertext-only attack named co-prime attack. It successfully attacks the common operations supported by SDB, including addition, comparison, sum, join and group-by. We evaluate our attack in three real-world benchmarks. For columns that support addition and comparison, we recover 84.9%-99.9% plaintexts. For columns that support sum, join and group-by, we recover 100% plaintexts. Besides, we provide potential countermeasures that can prevent the attacks against sum, join, group-by and addition. It is still an open problem to prevent the attack against comparison.

Software-Defined Data Protection: Low Overhead Policy Compliance at the Storage Layer is Within Reach! [Download Paper] Zsolt István (IT University Copenhagen), Soujanya Ponnapalli (UT Austin), Vijay Chidambaram (UT Austin and VMWare) Most modern data processing pipelines run on top of a distributed storage layer, and securing the whole system, and the storage layer in particular, against accidental or malicious misuse is crucial to ensuring compliance to rules and regulations. Enforcing data protection and privacy rules, however, stands at odds with the requirement to achieve higher and higher access bandwidths and processing rates in large data processing pipelines. In this work we describe our proposal for the path forward that reconciles the two goals. We call our approach ``Software-Defined Data Protection'' (SDP). Its premise is simple, yet powerful: decoupling often changing policies from request-level enforcement allows distributed smart storage nodes to implement the latter at line-rate. Existing and future data protection frameworks can be translated to the same hardware interface which allows storage nodes to offload enforcement efficiently both for company-specific rules and regulations, such as GDPR or CCPA. While SDP is a promising approach, there are several remaining challenges to making this vision reality. As we explain in the paper, overcoming these will require collaboration across several domains, including security, databases and specialized hardware design.

11:00 – 12:30 CESTResearch Session 46: Graph: Community Search Chaired by Laurent Bindschaedler

Butterfly-Core Community Search over Labeled Graphs [Download Paper] Zheng Dong (Baidu), Xin Huang (Hong Kong Baptist University), Guorui Yuan (Baidu), Hengshu Zhu (Baidu), Hui Xiong (Rutgers University) Community search aims at finding densely connected subgraphs for query vertices in a graph. While this task has been studied widely in the literature, most of the existing works only focus on finding homogeneous communities rather than heterogeneous communities with different labels. In this paper, we motivate a new problem of cross-group community search, namely Butterfly-Core Community (BCC), over a labeled graph, where each vertex has a label indicating its properties and an edge between two vertices indicates their cross relationship. Specifically, for two query vertices with different labels, we aim to find a densely connected cross community that contains two query vertices and consists of butterfly networks, where each wing of the butterflies is induced by a k-core search based on one query vertex and two wings are connected by these butterflies. We first develop a heuristic algorithm achieving $2$-approximation to the optimal solution. Furthermore, we design fast techniques of query distance computations, leader pair identifications, and index-based BCC local explorations. Extensive experiments on seven real datasets and four useful case studies validate the effectiveness and efficiency of our BCC and its multi-labeled extension models.

Efficient Size-Bounded Community Search over Large Networks [Download Paper] Kai Yao (The University of Sydney), Lijun Chang (The University of Sydney) The problem of community search, which aims to find a cohesive subgraph containing user-given query vertices, has been extensively studied recently. Most of the existing studies mainly focus on the cohesiveness of the returned community, while ignoring the size of the community, and may yield communities of very large size. However, many applications naturally require that the number of vertices/members in a community should fall within a certain range. In this paper, we design exact algorithms for the general size-bounded community search problem that aims to find a subgraph with the largest min-degree among all connected subgraphs that contain the query vertex $q$ and have at least $\ell$ and at most $h$ vertices, where $q, \ell, h$ are specified by the query. As the problem is NP-hard, we propose a branch-reduce-and-bound algorithm SC-BRB by developing nontrivial reducing techniques, upper bounding techniques, and branching techniques. Experiments on large real graphs show that SC-BRB on average increases the minimum degree of the community returned by the state-of-the-art heuristic algorithm GreedyF by a factor of $2.41$ and increases the edge density by a factor of $2.2$. In addition, SC-BRB is several orders of magnitude faster than a baseline approach, and all of our proposed techniques contribute to the efficiency of SC-BRB.

ICS-GNN: Lightweight Interactive Community Search via Graph Neural Network [Download Paper] Jun Gao (Peking University), Jiazun Chen (Peking University), Zhao Li (Alibaba Group), Ji Zhang (The University of Southern Queensland) Searching a community containing a given query vertex in an online social network enjoys wide applications like recommendation, team organization, etc. When applied to real-life networks, the existing approaches face two major limitations. First, they usually take two steps, ie crawling a large part of the network first and then finding the community next, but the entire network is usually too big and most of the data are not interesting to end users. Second, the existing methods utilize hand-crafted rules to measure community membership, while it is very difficult to define effective rules as the communities are flexible for different query vertices. In this paper, we propose an Interactive Community Search method based on Graph Neural Network (shortened by ICS-GNN) to locate the target community over a subgraph collected on the fly from an online network. Specifically, we recast the community membership problem as a vertex classification problem using GNN, which captures similarities between the graph vertices and the query vertex by combining content and structural features seamlessly and flexibly under the guide of users' labeling. We then introduce a k-sized Maximum-GNN-scores (shortened by kMG) community to describe the target community. We next discover the target community iteratively and interactively. In each iteration, we build a candidate subgraph using the crawled pages with the guide of the query vertex and labeled vertices, infer the vertex scores with a GNN model trained on the subgraph, and discover the kMG community which will be evaluated by end users to acquire more feedback. Besides, two optimization strategies are proposed to combine ranking loss into the GNN model and search more space in the target community location. We conduct the experiments in both offline and online real-life data sets, and demonstrate that ICS-GNN can produce effective communities with low overhead in communication, computation, and user labeling.

Scalable Community Detection via Parallel Correlation Clustering [Download Paper] Jessica Shi (MIT), Laxman Dhulipala (MIT CSAIL), David Eisenstat (Google), Jakub Łącki (Google), Vahab Mirrokni (Google) Graph clustering and community detection are central in modern data mining. The increasing need for analyzing billion-scale data sets calls for faster and more scalable algorithms. There are certain trade-offs between quality and speed of such clustering algorithms. In this paper, we aim to design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework that encompass modularity and correlation clustering. This frameworks applies to LambdaCC objective (introduced by Veldt et al.) unifying several quality measures. Our framework consists of highly-optimized implementations that demonstrably scale to large data sets of up to billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44x speedups over our own sequential baselines while maintaining or improving quality.

Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach [Download Paper] Guimu Guo (The University of Alabama at Birmingham), Da Yan (The University of Alabama at Birmingham), Tamer Özsu (University of Waterloo), Zhe Jiang (University of Alabama), Jalal Khalil (The University of Alabama at Birmingham)

13:30 – 15:00 CESTResearch Session 47: Transactions and Queries Chaired by Boris Novikov

Database Isolation By Scheduling [Download Paper] Kevin P Gaffney (University of Wisconsin-Madison), Robert K Claus (UW Madison), Jignesh Patel (UW - Madison) Transaction isolation is conventionally achieved by restricting access to the physical items in a database. To maximize performance, isolation functionality is often packaged with recovery, I/O, and data access methods in a monolithic transactional storage manager. While this design has historically afforded high performance in online transaction processing systems, industry trends indicate a growing need for a new approach in which intertwined components of the transactional storage manager are disaggregated into modular services. This paper presents a new method to modularize the isolation component. Our work builds on predicate locking, an isolation mechanism that enables this modularization by locking logical rather than physical items in a database. Predicate locking is rarely used as the core isolation mechanism because of its high theoretical complexity and perceived overhead. However, we show that this overhead can be substantially reduced in practice by optimizing for common predicate structures. We present DIBS, a transaction scheduler that employs our predicate locking optimizations to guarantee isolation as a modular service. We evaluate the performance of DIBS as the sole isolation mechanism in a data processing system. In this setting, DIBS scales up to 10.5 million transactions per second on a TATP workload. We also explore how DIBS can be applied to existing database systems to increase transaction throughput. DIBS reduces per-transaction file system writes by 90% on TATP in SQLite, resulting in a 3X improvement in throughput. Finally, DIBS reduces row contention on YCSB in MySQL, providing serializable isolation with a 1.4X improvement in throughput.

Epoch-based Commit and Replication in Distributed OLTP Databases [Download Paper] Yi Lu (MIT), Xiangyao Yu (University of Wisconsin-Madison), Lei Cao (MIT), Samuel Madden (MIT) Many modern data-oriented applications are built on top of distributed OLTP databases for both scalability and high availability. Such distributed databases enforce atomicity, durability, and consistency through two-phase commit (2PC) and synchronous replication at the granularity of every single transaction. In this paper, we present COCO, a new distributed OLTP database that supports epoch-based commit and replication. The key idea behind COCO is that it separates transactions into epochs and treats a whole epoch of transactions as the commit unit. In this way, the overhead of 2PC and synchronous replication is significantly reduced. We support two variants of optimistic concurrency control (OCC) using physical time and logical time with various optimizations, which are enabled by the epoch-based execution. Our evaluation on two popular benchmarks (YCSB and TPC-C) show that COCO outperforms systems with fine-grained 2PC and synchronous replication by up to a factor of four.

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats [Download Paper] Tianyu Li (Massachusetts Institute of Technology), Matthew Butrovich (Carnegie Mellon University), Amadou Ngom (Carnegie Mellon University), Wan Shen Lim (Carnegie Mellon University), Wes McKinney (Ursa Labs), Andrew Pavlo (Carnegie Mellon University) The proliferation of modern data processing tools has given rise to open-source columnar data formats. These formats help organizations avoid repeated conversion of data to a new format for each application. However, these formats are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. As a result, DBMSs often fail to take advantage of full network bandwidth when transferring data. We aim to reduce or even eliminate this overhead by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. We implemented our storage engine based on the Apache Arrow format and integrated it into the NoisePage DBMS to evaluate our work. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods.

Robustness against Read Committed for Transaction Templates [Download Paper] Brecht Vandevoort (Hasselt University), Bas Ketsman (Vrije Universiteit Brussel), Christoph Koch (EPFL, Switzerland), Frank Neven (Hasselt University) The isolation level Multiversion Read Committed (RC), offered by many database systems, is known to trade consistency for increased transaction throughput. Sometimes, transaction workloads can be safely executed under RC obtaining the perfect isolation of serializability at the lower cost of RC. To identify such cases, we introduce an expressive model of transaction programs to better reason about the serializability of transactional workloads. We develop tractable algorithms to decide whether any possible schedule of a workload executed under RC is serializable (referred to as the robustness problem). Our approach yields robust subsets that are larger than those identified by previous methods. We provide experimental evidence that workloads that are robust against RC can be evaluated faster under RC compared to stronger isolation levels. We discuss techniques for making workloads robust against RC by promoting selective read operations to updates. Depending on the scenario, the performance improvements can be considerable. Robustness testing and safely executing transactions under the lower isolation level RC can therefore provide a direct way to increase transaction throughput without changing DBMS internals.

Permutable Compiled Queries: Dynamically Adapting Compiled Queries without Recompiling [Download Paper] Prashanth Menon (Carnegie Mellon Universiy), Amadou Ngom (Carnegie Mellon University), Todd Mowry (Carnegie Mellon University), Andrew Pavlo (Carnegie Mellon University), Lin Ma (Carnegie Mellon University) Just-in-time (JIT) query compilation is a common technique to improve analytical query performance in database management systems (DBMSs). However, the cost of compiling each query can be significant relative to its execution time. This overhead prohibits the DBMS from employing well-known adaptive query processing (AQP) methods to generate a new plan for a query if data distributions do not match the optimizer's estimations. The optimizer could eagerly generate multiple sub-plans for a query, but it can only include a few alternatives as each addition increases the query compilation time. We present a new method, called Permutable Compiled Queries (PCQ), that bridges the gap between JIT compilation and AQP. It allows the DBMS to modify compiled queries without needing to recompile the plan or including all possible variations before the query starts. With PCQ, the DBMS structures a query's code to include specialized data structures in the logic that facilitate dynamic changes. These structures provide an indirection layer that enables the DBMS to change the plan even while it is running. We implement PCQ in an in-memory DBMS and compare it against non-adaptive plans in a microbenchmark and against state-of-the-art analytic DBMSs. Our evaluation shows that PCQ outperforms static plans by more than 4x and yields better performance on an analytical benchmark by more than 2x against other DBMSs.

13:30 – 15:00 CESTResearch Session 48: Differential Privacy II Chaired by Divesh Srivastava

Real-World Trajectory Sharing with Local Differential Privacy [Download Paper] Teddy Cunningham (University of Warwick), Graham Cormode (University of Warwick), Hakan Ferhatosmanoglu (University of Warwick), Divesh Srivastava (AT&T Labs Research) Sharing trajectories is beneficial for many real-world applications, such as managing disease spread through contact tracing and tailoring public services to a population's travel patterns. However, public concern over privacy and data protection has limited the extent to which this data is shared. Local differential privacy enables data sharing in which users share a perturbed version of their data, but existing mechanisms fail to incorporate user-independent public knowledge (e.g., business locations and opening times, public transport schedules, geo-located tweets). This limitation makes mechanisms too restrictive, gives unrealistic outputs, and ultimately leads to low practical utility. To address these concerns, we propose a local differentially private mechanism that is based on perturbing hierarchically-structured, overlapping n-grams (i.e., contiguous subsequences of length n) of trajectory data. Our mechanism uses a multi-dimensional hierarchy over publicly available external knowledge of real-world places of interest to improve the realism and utility of the perturbed, shared trajectories. Importantly, including real-world public data does not negatively affect privacy or efficiency. Our experiments, using real-world data and a range of queries, each with real-world application analogues, demonstrate the superiority of our approach over a range of competing methods.

Budget Sharing for Multi-Analyst Differential Privacy [Download Paper] David A Pujol (Duke University), Yikai Wu (Duke University), Brandon T Fain (Duke University), Ashwin Machanavajjhala (Duke) Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to protect the privacy of individuals from whom they collect data. Differential Privacy (DP) provides a solution to release useful summary data while preserving privacy. However, most DP mechanisms are designed to answer a single set of queries and optimize the total accuracy. In reality, there are often multiple stakeholders that use a given data release and have overlapping but not-identical queries. This introduces a novel joint optimization problem in DP where the privacy budget must be shared among different analysts. In this work, we initiate study into the problem of DP query answering across multiple analysts. To capture the competing goals and priorities of multiple analysts, we formulate three desiderata that any mechanism should satisfy in this setting -- The Sharing Incentive, Non-Interference, and Workload Adaptivity -- while still optimizing for overall error. We demonstrate how existing DP query answering mechanisms in the multi-analyst settings fail to satisfy at least one of the desiderata. We present novel DP algorithms that provably satisfy all our desiderata and empirically show that they incur low error on realistic tasks.

Improving Utility and Security of the Shuffler-based Differential Privacy [Download Paper] Tianhao Wang (Purdue University), Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group), Min Xu (University of Chicago), Zhicong Huang (Alibaba Group), Cheng Hong (Alibaba Group), Jingren Zhou (Alibaba Group), Ninghui Li (Purdue University), Somesh Jha (University of Wisconsin-Madison and XaiPient) When collecting information, local differential privacy (LDP) alle-viates privacy concerns of users because their private information is randomized before being sent it to the central aggregator. LDP imposes large amount of noise as each user executes the randomiza-tion independently. To address this issue, recent work introduced an intermediate server with the assumption that this intermediate server does not collude with the aggregator. Under this assump-tion, less noise can be added to achieve the same privacy guarantee as LDP, thus improving utility for the data collection task. This paper investigates this multiple-party setting of LDP. We analyze the system model and identify potential adversaries. We then make two improvements: a new algorithm that achieves a bet-ter privacy-utility tradeoff; and a novel protocol that provides better protection against various attacks. Finally, we perform experiments to compare different methods and demonstrate the benefits of using our proposed method.

Local Dampening: Differential Privacy for Non-numeric Queries via Local Sensitivity [Download Paper] Victor Farias (Universidade Federal do Ceara), Felipe Brito (LSBD/UFC), Cheryl Flynn (AT&T Labs Research), Javam C Machado (LSBD/UFC), Subhabrata Majumdar (AT&T Labs Research), Divesh Srivastava (AT&T Labs Research) Differential privacy is the state-of-the-art formal definition for data release under strong privacy guarantees. A variety of mechanisms have been proposed in the literature for releasing the noisy output of numeric queries (e.g., using the Laplace mechanism), based on the notions of global sensitivity and local sensitivity. However, although there has been some work on generic mechanisms for releasing the output of non-numeric queries using global sensitivity (e.g., the Exponential mechanism), the literature lacks generic mechanisms for releasing the output of non-numeric queries using local sensitivity to reduce the noise in the query output. In this work, we remedy this shortcoming and present the local dampening mechanism. We adapt the notion of local sensitivity for the non-numeric setting and leverage it to design a generic nonnumeric mechanism. We illustrate the effectiveness of the local dampening mechanism by applying it to two diverse problems: (i) Influential node analysis. Given an influence metric, we release the top-k most influential nodes while preserving the privacy of the relationship between nodes in the network; (ii) Decision tree induction. We provide a private adaptation to the ID3 algorithm to build decision trees from a given tabular dataset. Experimental results show that we could reduce the use of privacy budget by 3 to 4 orders of magnitude for Influential node analysis and increase accuracy up to 12% for Decision tree induction when compared to global sensitivity based approaches.

Frequency Estimation under Local Differential Privacy [Download Paper] Graham Cormode (University of Warwick), Sam Maddock (University of Warwick), Carsten Maple (University of Warwick) Private collection of statistics from a large distributed population is an important problem, and has led to large scale deployments from several leading technology companies. The dominant approach requires each user to randomly perturb their input, leading to guarantees in the local differential privacy model. In this paper, we place the various approaches that have been suggested into a common framework, and perform an extensive series of experiments to understand the tradeoffs between different implementation choices. Our conclusion is that for the core problems of frequency estimation and heavy hitter identification, careful choice of algorithms can lead to very effective solutions that scale to millions of users.

13:30 – 15:00 CESTResearch Session 49: Distributed Machine Learning Chaired by Matthias Boehm

Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches [Download Paper] Yuhao Zhang (University of California, San Diego), Frank McQuillan (VMware), Nandish Jayaram (Intuit), Nikhil Kak (VMware), Ekta Khanna (VMware), Orhan Kislal (VMware), Domino Valdano (VMware), Arun Kumar (University of California, San Diego) Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single ``best'' approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.

Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees [Download Paper] Dimitrije Jankov (Rice University), Binhang Yuan (Rice University), Shangyu Luo (Rice University), Chris Jermaine (Rice University) When numerical and machine learning (ML) computations are expressed relationally, classical query execution strategies (hash-based joins and aggregations) can do a poor job distributing the computation. In this paper, we propose a two-phase execution strategy for numerical computations that are expressed relationally, as aggregated join trees (that is, expressed as a series of relational joins followed by an aggregation). In a pilot run, lineage information is collected; this lineage is used to optimally plan the computation at the level of individual records. Then, the computation is actually executed. We show experimentally that a relational system making use of this two-phase strategy can be an excellent platform for distributed ML computations.

Tensor Relational Algebra for Distributed Machine Learning System Design [Download Paper] Binhang Yuan (Rice University), Dimitrije Jankov (Rice University), Jia Zou (Arizona State University), Yuxin Tang (Rice University), Daniel Bourgeois (Rice University), Chris Jermaine (Rice University) We consider the question: what is the abstraction that should be implemented by the computational engine of a machine learning system? Current machine learning systems typically push whole tensors through a series of compute kernels such as matrix multiplications or activation functions, where each kernel runs on an AI accelerator (ASIC) such as a GPU. This implementation abstraction provides little built-in support for ML systems to scale past a single machine, or for handling large models with matrices or tensors that do not easily fit into the RAM of an ASIC. In this paper, we present an alternative implementation abstraction called the tensor relational algebra (TRA). The TRA is a set-based algebra based on the relational algebra. Expressions in the TRA operate over binary tensor relations, where keys are multi-dimensional arrays and values are tensors. The TRA is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML workflows in distributed clusters.

ParaX: Boosting Deep Learning for Big Data Analytics on Many-Core CPUs [Download Paper] Lujia Yin (NUDT), Yiming Zhang (NUDT), Zhaoning Zhang (NUDT), Yuxing Peng (NUDT), Peng Zhao (Intel) Despite the fact that GPUs and accelerators are more efficient in deep learning (DL), commercial clouds now heavily use CPUs in DL computation because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current plat-forms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers. This paper presents a novel approach called ParaX, which boosts the performance of DL on many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultra-light scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. Extensive evaluation shows that ParaX achieves significant acceleration for DL on many-core CPUs.

Declarative Data Serving: The Future of Machine Learning Inference on the Edge [Download Paper] Ted Shaowang (University of Chicago), Nilesh Jain (Intel), Dennis Matthews (Intel), Sanjay Krishnan (U Chicago) Recent advances in computer architecture and networking have ushered in a new age of edge computing, where computation is placed close to the point of data collection to facilitate low-latency decision making. As the complexity of such deployments grow into networks of interconnected edge devices, getting the necessary data to be in ``the right place at the right time'' can become a challenge. We envision a future of edge analytics where data flows between edge nodes are declaratively configured through high-level constraints. Using machine learning model-serving as a prototypical task, we illustrate how the heterogeneity and specialization of edge devices can lead to complex, task-specific communication patterns even in relatively simple situations. Without a declarative framework, managing this complexity will be challenging for developers and will lead to brittle systems. We conclude with a research vision for database community that brings our perspective to the emergent area of edge computing.

13:30 – 15:00 CESTResearch Session 50: Distributed Analytics Chaired by Haralampos Gavriilidis

Parallel Discrepancy Detection and Incremental Detection [Download Paper] Wenfei Fan (Univ. of Edinburgh), Chao Tian (Alibaba Group), Yanghao Wang (University of Edinburgh), Qiang Yin (Alibaba Group) This paper studies how to catch duplicates, mismatches and conflicts in the same process. We adopt a class of entity enhancing rules that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We detect discrepancies as violations of such rules. We establish the complexity of discrepancy detection and incremental detection problems with the rules; they are both \NP-complete and \W[1]-hard. To cope with the intractability and scale with large datasets, we develop parallel algorithms and parallel incremental algorithms for discrepancy detection. We show that both algorithms are parallelly scalable, i.e., they guarantee to reduce runtime when more processors are used. Moreover, the parallel incremental algorithm is relatively bounded. The complexity bounds and algorithms carry over to denial constraints, a special case of the entity enhancing rules. Using real-life and synthetic datasets, we experimentally verify the effectiveness, scalability and efficiency of the algorithms.

Space- and Computationally-Efficient Set Reconciliation via Parity Bitmap Sketch (PBS) [Download Paper] Long Gong (Facebook), Ziheng Liu (Peking University), Liang Liu (Georgia Institute of Technology), Jun Xu (Georgia Tech), Mitsunori Ogihara (University of Miami), Tong Yang (Peking University) Set reconciliation is a fundamental algorithmic problem that arises in many networking, system, and database applications. In this problem, two large sets A and B of objects (bitcoins, files, records, etc.) are stored respectively at two different network-connected hosts, which we name Alice and Bob respectively. Alice and Bob communicate with each other to learn $A\Delta B$, the difference between A and B, and as a result the reconciled set $A\bigcup B$. Current set reconciliation schemes are based on either Invertible Bloom Filters (IBF) or Error-Correction Codes (ECC). The former has a low computational complexity of O(d), where d is the cardinality of $A\Delta B$, but has a high communication overhead that is several times larger than the theoretical minimum. The latter has a low communication overhead close to the theoretical minimum, but has a much higher computational complexity of $O(d^2)$. In this work, we propose Parity Bitmap Sketch (PBS), an ECC- based set reconciliation scheme that gets the better of both worlds: PBS has both a low computational complexity of O(d) just like IBF-based solutions and a low communication overhead of roughly twice the theoretical minimum. A separate contribution of this work is a novel rigorous analytical framework that can be used for the precise calculation of various performance metrics and for the near-optimal parameter tuning of PBS.

Elle: Inferring Isolation Anomalies from Experimental Observations [Download Paper] Peter Alvaro (UC Santa Cruz), Kyle Kingsbury (Jepsen) Users who care about their data store it in databases, which (at least in principle) guarantee some form of transactional isolation. However, experience shows that many databases do not provide the isolation guarantees they claim. With the recent proliferation of new distributed databases, demand has grown for checkers that can, by generating client workloads and injecting faults, produce anomalies that wit- ness a violation of a stated guarantee. An ideal checker would be sound (no false positives), efficient (polynomial in history length and concurrency), effective (finding viola- tions in real databases), general (analyzing many patterns of transactions), and informative (justifying the presence of an anomaly with understandable counterexamples). Sadly, we are aware of no checkers that satisfy these goals. We present Elle: a novel checker which infers an Adya- style dependency graph between client-observed transactions. It does so by carefully selecting database objects and oper- ations when generating histories, so as to ensure that the results of database reads reveal information about their ver- sion history. Elle can detect every anomaly in Adya et al’s formalism (except for predicates), discriminate between them, and provide concise explanations of each. This paper makes the following contributions: we present Elle, demonstrate its soundness over specific datatypes, measure its efficiency against the current state of the art, and give evidence of its effectiveness via a case study of four real databases.

Language-Agnostic Integrated Queries in a Managed Polyglot Runtime [Download Paper] Filippo Schiavio (Università della Svizzera italiana (USI)), Daniele Bonetta (Oracle Labs), Walter Binder (Università della Svizzera italiana (USI)) Language-integrated query (LINQ) frameworks offer a convenient programming abstraction for processing in-memory collections of data, allowing developers to concisely express declarative queries using general-purpose programming languages. Existing LINQ frameworks rely on the well-defined type system of statically-typed languages such as C♯ or Java to perform query compilation and execution. As a consequence of this design, they do not support dynamic languages such as Python, R, or JavaScript. Such languages are however very popular among data scientists, who would certainly benefit from LINQ frameworks in data analytics applications. In this work we bridge the gap between dynamic languages and LINQ frameworks. We introduce DynQ, a novel query engine designed for dynamic languages. DynQ is language-agnostic, since it is able to execute SQL queries in a polyglot language runtime. Moreover, DynQ can execute queries combining data from multiple sources, namely in-memory object collections as well as on-file data and external database systems. Our evaluation of DynQ shows performance comparable with equivalent hand-optimized code, and in line with common data-processing libraries and embedded databases, making DynQ an appealing query engine for standalone analytics applications and for data-intensive server-side workloads.

Lachesis: Automatic Partitioning for UDF-Centric Analytics [Download Paper] Jia Zou (Arizona State University), Amitabh Das (Arizona State University), Pratik Barhate (Arizona State University), Arun Iyengar (IBM T.J. Watson Research Center), Binhang Yuan (Rice University), Dimitrije Jankov (Rice University), Chris Jermaine (Rice University) Partitioning is effective in avoiding expensive shuffling operations. However, it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions (UDFs), where sub-computations are hard to be reused for partitionings compared to relational applications. In addition, functional dependency that is widely utilized for partitioning selection is often unavailable in the unstructured data that is ubiquitous in UDF-centric analytics. We propose the Lachesis system, which represents UDF-centric workloads as workflows of analyzable and reusable sub-computations. Lachesis further adopts a deep reinforcement learning model to infer which sub-computations should be used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications to improve the performance and users' productivity.

13:30 – 15:00 CESTResearch Session 51: Query Execution Chaired by Steffen Zeuch

Beyond Equi-joins: Ranking, Enumeration and Factorization [Download Paper] Nikolaos Tziavelis (Northeastern University), Wolfgang Gatterbauer (Northeastern University), Mirek Riedewald (Northeastern University) We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k, the k top-ranked answers are returned in O(n polylog n + k log k) time. This is within a polylogarithmic factor of the best known complexity for equi-joins and even of O(n+k), the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections, such as the so-called free-connex queries. Remarkably, they hold even when the entire output is of size n^ℓ for a join of ℓ relations. The key ingredient is a novel O(n polylog n)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

A Practical Approach to Groupjoin and Nested Aggregates [Download Paper] Philipp Fent (TUM), Thomas Neumann (TUM) Groupjoins, the combined execution of a join and a subsequent group by, are common in analytical queries, and occur in about 1/8 of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention, which limits their usefulness in a many-core system. Having an efficient implementation of groupjoins is highly desirable, as groupjoins are not only used to fuse group by and join but are also introduced by the unnesting component of the query optimizer to avoid nested-loops evaluation of aggregates. Furthermore, the query optimizer needs be able to reason over the result of aggregation in order to schedule it correctly. Traditional selectivity and cardinality estimations quickly reach their limits when faced with computed columns from nested aggregates, which leads to poor cost estimations and thus, suboptimal query plans. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose two novel techniques, aggregate estimates to predict the result distribution of aggregates, and parallel groupjoin execution for a scalable execution of groupjoins. The resulting system has significantly better estimates and a contention-free evaluation of groupjoins, which can speed up some TPC-H queries up to a factor of 2.

Stacked Filters: Learning to Filter by Structure [Download Paper] Kyle B Deeds (Harvard University), Brian N Hentschel (Harvard University), Stratos Idreos (Harvard) We present Stacked Filters, a new probabilistic filter which is fast and robust similar to query-agnostic filters (such as Bloom and Cuckoo filters), and at the same time brings low false positive rates and sizes similar to classifier-based filters (such as Learned Filters). The core idea is that Stacked Filters incorporate workload knowledge about frequently queried non-existing values. Instead of learning, they structurally incorporate that knowledge using hashing and several sequenced filter layers, indexing both data and frequent negatives. Stacked Filters can also gather workload knowledge on-the-fly and adaptively build the filter. We show experimentally that for a given memory budget, Stacked Filters achieve end-to-end query throughput up to 130x better than the best alternative for a workload, either query-agnostic or classifier-based filters, and depending on where data is (SSD or HDD).

Charting the Design Space of Query Execution using VOILA [Download Paper] Tim Gubner (CWI), Peter Boncz (CWI) Database architecture, while having been studied for four decades now, has delivered only a few designs with well-understood properties. These few are followed by most actual systems. Acquiring more knowledge about the design space is a very time-consuming processes that requires manually crafting prototypes with a low chance of generating material insight. We propose a framework that aims to accelerate this exploration process significantly. Our framework enables synthesizing many different engines from a description in a carefully designed domain-specific language (VOILA). We explain basic concepts and formally define the semantics of VOILA. We demonstrate VOILA's flexibility by presenting translation back-ends that allow the synthesis of state-of-the-art paradigms (data-centric compilation, vectorized execution, AVX-512), mutations and mixes thereof. We show-case VOILA's flexibility by exploring the query engine design space in an automated fashion. We generated thousands of query engines and report our findings. Queries generated by VOILA achieve similar performance as state-of-the-art hand-optimized implementations and are up to 35.5x faster than well-known systems.

13:30 – 15:00 CESTResearch Session 52: Graph Processing II Chaired by Riccardo Tommasini

Columnar Storage and List-based Processing for Graph Database Management Systems [Download Paper] Pranjal Gupta (University of Waterloo), Amine Mhedhbi (University of Waterloo), Semih Salihoglu (University of Waterloo) We revisit column-oriented storage and query processing techniques in the context of contemporary graph database management systems (GDBMSs). Similar to column-oriented RDBMSs, GDBMSs support read-heavy analytical work-loads that however have fundamentally different data access patterns than traditional analytical workloads. We first derive a set of desiderata for optimizing storage and query processors of GDBMS based on their access patterns. We then present the design of columnar storage, compression, and query processing techniques based on these desiderata. In addition to showing direct integration of existing techniques from columnar RDBMSs, we also propose novel ones that are optimized for GDBMSs. These include a novel list-based query processor, which avoids expensive data copies of traditional block-based processors under many-to-many joins and avoids materializing adjacency lists in intermediate tuples, a new data structure we call single-indexed edge property pages and an accompanying edge ID scheme, and a new application of Jacobson’s bit vector index for compressing NULL values and empty lists. We integrated our techniques into the GraphflowDB in-memory GDBMS. Through extensive experiments, we demonstrate the scalability and query performance benefits of our techniques.

On Querying Historical K-Cores [Download Paper] Michael R Yu (UNSW), Dong Wen (University of Technology Sydney), Lu Qin (UTS), Ying Zhang (University of Technology Sydney), Wenjie Zhang (University of New South Wales), Xuemin Lin (University of New South Wales) Many real-world relationships between entities can be modeled as temporal graphs, where each edge is associated with a timestamp or a time interval representing its occurrence. K-core is a fundamental model used to capture cohesive subgraphs in a simple graph and have drawn much research attention over the last decade. Despite widespread research, none of the existing works support the efficient querying of historical k-cores in temporal graphs. In this paper, given an integer k and a time window, we study the problem of computing all k-cores in the graph snapshot over the time window. We propose an index-based solution and several pruning strategies to reduce the index size. We also design a novel algorithm to construct this index, whose running time is linear to the final index size. Lastly, we conducted extensive experiments on several real-world temporal graphs to show the high effectiveness of our index-based solution.

An Experimental Evaluation and Guideline for Path Finding in Weighted Dynamic Network [Download Paper] Mengxuan Zhang (The University of Queensland), Lei Li (University of Queensland), Xiaofang Zhou (The Hong Kong University of Science and Technology) The shortest path computation is a building block of various network applications. Because the real-life networks are evolving as time passes by, Dynamic Shortest Path (DSP) problem has drawn lots of research attention in recent years. However, as DSP has many factors related to network topology, update pattern, and query characteristics, while the existing works only test their algorithms on very limited situations and comparisons, it is still hard to choose the most suitable method during implementation. To this end, we first identify the determinant dimensions and constraint dimensions of DSP problem and create a complete problem space to cover all possible situations. Then we evaluate the state-of-the-art DSP methods under the same implementation standard and test them systematically under a set of synthetic dynamic networks. Furthermore, we propose the dynamic degree to classify the dynamic environments and use throughput evaluate their dynamic performance. These results serve as the guidelines to find the best solution for each situation during system implementation, and also identify the research opportunities. Finally, we validate our findings in the real-life dynamic networks.

Materializing Knowledge Bases via Trigger Graphs [Download Paper] Efthymia Tsamoura (Samsung AI Research), David Carral (LIRMM, Inria, University of Montpellier, CNRS), Enrico Malizia (University of Bologna), Jacopo Urbani (Vrije Universiteit Amsterdam) The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs) for tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of the rules avoiding redundant computations. We present the results of an extensive theoretical and empirical study that seeks to answer when and how TGs can be computed and what are the benefits of TGs when applied over real-world KBs. Our results include introducing algorithms that compute (minimal) TGs. We implemented our approach in a new engine, called GLog, and our experiments show that it can be significantly more efficient than the chase enabling us to materialize Knowledge Graphs with 17B facts in less than 40 min using a single machine with commodity hardware.

Estimating Spread of Contact-Based Contagions in a Population Through Sub-Sampling [Download Paper] Sepanta Zeighami (University of Southern California), Cyrus Shahabi (Computer Science Department. University of Southern California), John Krumm (Microsoft Research) Various phenomena such as viruses, gossips, and physical objects (e.g., packages and marketing pamphlets) can be spread through physical contacts. The spread depends on how people move, i.e., their mobility patterns. In practice, mobility patterns of an entire population is never available, and we usually have access to location data of a subset of individuals. In this paper, we formalize and study the problem of estimating the spread of a phenomena in a population, given that we only have access to sub-samples of location visits of some individuals in the population. We show that simple solutions that estimate the spread in the sub-sample and scale it to the population, or more sophisticated solutions that rely on modeling location visits of individuals do not perform well in practice. Instead, we directly model the co-locations between the individuals. We introduce PollSpreader and PollSusceptible, two novel approaches that model the co-locations between individuals using a contact network, and infer the properties of the contact network using the sub-sample to estimate the spread of the phenomena in the entire population. We analytically show that our estimates provide an upper bound and a lower bound on the spread of the disease in expectation. Finally, using a large high-resolution real-world mobility dataset, we experimentally show that our estimates are accurate in practice, while other methods that do not correctly account for co-locations between individuals result in entirely wrong observations (e.g, premature prediction of herd-immunity).