VLDB 2022: Paper Sessions
06Sep
SNARF: A Learning-Enhanced Range Filter [Download Paper] Kapil Vaidya (MIT)*, Tim Kraska (MIT), Subarna Chatterjee (Harvard University ), Eric R Knorr (Harvard), Michael Mitzenmacher (Harvard), Stratos Idreos (Harvard) We present Sparse Numerical Array-Based Range Filters (SNARF), a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads.
ByteHTAP: ByteDance's HTAP System with High Data Freshness and Strong Data Consistency [Industry] [Download Paper] Jianjun Chen (Bytedance)*, Yonghua Ding (Bytedance.com), Ye Liu (Bytedance Inc.), Fangshi Li (Bytedance), Li Zhang (ByteDance), Mingyi Zhang (ByteDance Inc), Kui Wei (ByteDance Inc.), Cao Lixun (ByteDance), Dan Zou (ByteDance), Yang Liu (ByteDance), Lei Zhang (ByteDance), Rui Shi (ByteDance Inc.), Wei Ding (Bytedance), KAI WU (ByteDance), Shangyu Luo (ByteDance), Jason Sun (Bytedance ), Yuming Liang (ByteDance Inc.) In recent years, at ByteDance, we see more and more business scenarios that require performing complex analysis over freshly imported data, together with transaction support and strong data consistency. In this paper, we describe our journey of building ByteHTAP, an HTAP system with high data freshness and strong data consistency. It adopts a separate-engine and shared-storage architecture. Its modular system design fully utilizes an existing ByteDanceâs OLTP system and an open source OLAP system. This choice saves us a lot of resources and development time, and allows easy future extensions such as replacing the query processing engine with other alternatives. ByteHTAP can provide high data freshness with less than one second delay, which enables many new business opportunities for our customers. Customers can also configure different data freshness thresholds based on their business needs. ByteHTAP also provides strong data consistency through global timestamps across its OLTP and OLAP system, which greatly relieves application developers from handling complex data consistency issues by themselves. In addition, we introduce some important performance optimizations to ByteHTAP, such as pushing computations to the storage layer and using delete bitmaps to efficiently handle deletes. Lastly, we will share our lessons and best practices in developing and running ByteHTAP in production.
Query Processing on Tensor Computation Runtimes [Download Paper] Dong He (University of Washington)*, Supun C Nakandala (University of California, San Diego), Dalitso Banda (Microsoft), Rathijit Sen (Microsoft), Karla Saur (Microsoft), Kwanghyun Park (Microsoft), Carlo Curino (Microsoft -- GSL), Jesús Camacho-rodrÃguez (Microsoft), Konstantinos Karanasos (Meta), Matteo Interlandi (Microsoft) The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how database management systems can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10à over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9à speedup over CPU baselines.
Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS [Download Paper] Bobbi W Yogatama (University of Wisconsin-Madison)*, Weiwei Gong (Oracle America), Xiangyao Yu (University of Wisconsin-Madison) There has been a growing interest in using GPU to accelerate data analytics due to its massive parallelism and high memory bandwidth. The main constraint of using GPU for data analytics is the limited capacity of GPU memory. Heterogeneous CPU-GPU query execution is a compelling approach to mitigate the limited GPU memory capacity and PCIe bandwidth. However, the design space of heterogeneous CPU-GPU query execution has not been fully explored. We aim to improve state-of-the-art CPU-GPU data analytics engine by optimizing data placement and heterogeneous query execution. First, we introduce a semantic-aware fine-grained caching policy which takes into account various aspects of the workload such as query semantics, data correlation, and query frequency when determining data placement between CPU and GPU. Second, we introduce a heterogeneous query executor which can fully exploit data in both CPU and GPU and coordinate query execution at a fine granularity. We integrate both solutions in Mordred, our novel hybrid CPU-GPU data analytics engine. Evaluation on the Star Schema Benchmark shows that the semantic-aware caching policy can outperform the best traditional caching policy by up to 3x. Compared to existing GPU DBMSs, Mordred can outperform by an order of magnitude.
Design Trade-offs for a Robust Dynamic Hybrid Hash Join [Download Paper] Shiva Jahangiri (University of California, Irvine)*, Michael Carey (UC Irvine), Johann-christoph Freytag (Humboldt-Universität zu Berlin) Hybrid Hash Join (HHJ) has proven to be one of the most efficient and widely-used join algorithms. While HHJâs performance depends largely on accurate statistics and information about the input relations, it may not always be practical or possible for a system to have such information available. HHJâs design depends on many details to perform well. This paper is an experimental and analytical study of the trade-offs in designing a robust and dynamic HHJ operator. We revisit the design and optimization techniques suggested by previous studies through extensive experiments, comparing them with other algorithms designed by us or used in related studies. We explore the impact of the number of partitions on HHJâs performance and propose a new lower bound for the number of partitions. We design and evaluate different partition insertion techniques to maximize memory utilization with the least CPU cost. Additionally, we consider a comprehensive set of algorithms for dynamically selecting a partition to spill and compare the results against previously published studies. We then present and evaluate two alternative growth policies for spilled partitions. These algorithms have been implemented in the context of Apache AsterixDB and evaluated under different scenarios such as variable record sizes, different distributions of join attributes, and different storage types, including HDD, SSD, and Amazon Elastic Block Store (Amazon EBS).
DBOS: A DBMS-oriented Operating System [Download Paper] Athinagoras Skiadopoulos (Stanford University)*, Qian Li (Stanford University), Peter Kraft (Stanford University), Kostis Kaffes (Stanford University), Daniel Hong (Massachusetts Institute of Technology (MIT) Media Lab), Shana Mathew (Massachusetts Institute of Technology), David Bestor (MIT), Michael Cafarella (MIT CSAIL), Vijay Gadepally (MIT Lincoln Laboratory - USA), Goetz Graefe (Google), Jeremy Kepner (MIT Lincoln Laboratory), Christos Kozyrakis (Stanford University), Tim Kraska (MIT), Michael Stonebraker (MIT), Lalith Suresh (VMware Research), Matei Zaharia (Stanford and Databricks) This paper lays out the rationale for building a completely new operating system (OS) stack. Rather than build on a single node OS together with separate cluster schedulers, distributed filesystems, and network managers, we argue that a distributed transactional DBMS should be the basis for a scalable cluster OS. We show herein that such a database OS (DBOS) can do scheduling, file management, and inter-process communication with competitive performance to existing systems. In addition, significantly better analytics can be provided as well as a dramatic reduction in code complexity through implementing OS services as standard database queries, while implementing low-latency transactions and high availability only once.
SQLite: Past, Present, and Future [Download Paper] [Industry] Kevin P Gaffney (University of Wisconsin-Madison)*, Martin Prammer (University of Wisconsin - Madison), Laurence C Brasfield (SQLite devs), Richard Hipp (SQLite.org), Dan R Kennedy (Sqlite), Jignesh Patel (UW - Madison) In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. Several factors are likely responsible for its ubiquity, including its in-process design, standalone codebase, extensive test suite, and cross-platform file format. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares. Furthermore, we are aware of no work that attempts to identify root causes for SQLite's performance behavior on OLAP workloads. In this paper, we discuss SQLite in the context of this changing workload landscape. We describe how SQLite evolved from its humble beginnings to the full-featured database engine it is today. We evaluate the performance of modern SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. We delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, we discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.
New wine in an old bottle: Data-aware hash functions for Bloom filters [Download Paper] Arindam Bhattacharya (IIT DELHI)*, Chathur Gudesa (Indian Institute of Technology Delhi), Amitabha Bagchi (IIT Delhi), Srikanta Bedathur (IIT Delhi) In many applications of Bloom filters, it is possible to exploit the patterns present in the inserted and non-inserted keys to achieve more compression than the standard Bloom filter. A new class of Bloom filters called Learned Bloom filters use machine learning models to exploit these patterns in the data. In practice, these methods and their variants raise many practical issues: the choice of machine learning models, the training paradigm to achieve the desired results, the choice of thresholds, the number of partitions in case multiple partitions are used, and other such design decisions. In this paper, we present a simple partitioned Bloom filter that works as follows: we partition the Bloom filter into segments, each of which uses a simple projection-based hash function computed using the data. We also provide a theoretical analysis that provides a principled way to select the design parameters of our method: number of hash functions and number of bits per partition. We perform empirical evaluations of our methods on various real-world datasets spanning several applications. We show that it can achieve an improvement in false positive rates of up to two orders of magnitude over standard Bloom filters for the same memory usage, and up to 50% better compression (bytes used per key) for same FPR, and, consistently beats the variants of learned Bloom filters.
Plush: A Write-Optimized Persistent Log-Structured Hash-Table [Download Paper] Lukas Vogel (TUM)*, Alexander Van Renen (Friedrich-Alexander-Universität Erlangen-Nürnberg), Satoshi Imamura (Fujitsu Laboratories Ltd.), Jana Giceva (TU Munich), Thomas Neumann (TUM), Alfons Kemper (TUM) Persistent memory (PMem) promised DRAM-like performance, byte addressability, and the persistency guarantees of conventional block storage. With the release of Intel Optane DCPMM, those expectations were dampened. While its write latency competes with DRAM, its read latency, write endurance, and especially bandwidth fall behind by up to an order of magnitude. Established PMem index structures mostly focus on lookups and cannot leverage PMem's low write latency. For inserts, DRAM-optimized index structures are still an order of magnitude faster than their PMem counterparts despite the similar write latency. We identify the combination of PMem's low write bandwidth and the existing solutions' high media write amplification as the culprit. We present Plush, a write-optimized, hybrid hash table for PMem with support for variable-sized keys and values. It minimizes media write and read amplification while exploiting PMem's unique advantages, namely its low write latency and full bandwidth even for small reads and writes. On a 24-core server with 768 GB of Intel Optane DPCMM, Plush outperforms state-of-the-art PMem-optimized hash tables by up to 2.44 times for inserts while only using a tiny amount of DRAM. It achieves this speedup by reducing write amplification by 80%. For lookups, its throughput is similar to that of established PMem-optimized tree-like index structures.
Doppler: Automated SKU Recommendation in Migrating SQL Workloads to the Cloud [Download Paper] [Industry] Joyce Cahoon (Microsoft), Wenjing Wang (microsoft), Yiwen Zhu (Microsoft)*, Katherine Lin (Microsoft), Sean Liu (Microsoft), Raymond Truong (Microsoft), Neetu Singh (Microsoft), Chengcheng Wan (University of Chicago), Alexandra M Ciortea (Microsoft), Sreraman Narasimhan (Microsoft), Subru Krishnan (Microsoft) Selecting the optimal cloud target to migrate SQL estates from on-premises to the cloud remains a challenge. Current solutions are not only time-consuming and error-prone, requiring significant user input, but also fail to provide appropriate recommendations. We present Doppler, a scalable recommendation engine that provides right-sized Azure SQL Platform-as-a-Service (PaaS) recommendations without requiring access to sensitive customer data and queries. Doppler introduces a novel price-performance methodology that allows customers to get a personalized rank of relevant cloud targets solely based on low-level resource statistics, such as latency and memory usage. Doppler supplements this rank with internal knowledge of Azure customer behavior to help guide new migration customers towards one optimal target. Experimental results over a 9-month period from prospective and existing customers indicate that Doppler can identify optimal targets and adapt to changes in customer workloads. It has also found cost-saving opportunities among over-provisioned cloud customers, without compromising on capacity or other requirements. Doppler has been integrated and released in the Azure Data Migration Assistant v5.5, which receives hundreds of assessment requests daily.
Waffle: In-memory Grid Index for Moving Objects with Reinforcement Learning-based Configuration Tuning System [Download Paper] Dalsu Choi (Korea University), Hyunsik Yoon (Korea University), Hyubjin Lee (Korea University), Yon Dohn Chung (Korea University)* Location-based services for moving objects are close to our lives. For example, ride-sharing services, micro-mobility services, navigation and traffic management, delivery services, and autonomous driving are all based on moving objects. The efficient management of such moving objects is therefore getting more and more important. The main challenge is the handling of a large number of location-update queries with scan queries. To address this challenge, we propose a novel in-memory grid indexing system, Waffle, for moving objects. Waffle divides a geographical space into fixed-sized cells. For efficient query processing, Waffle forms chunks, each of which consists of neighboring cells. Such a Waffle index is defined by several configuration knobs. A knob configuration has a significant impact on the performance of Waffle, and an appropriate configuration may change as objects continuously move. Therefore, we propose an online configuration tuning system, WaffleMaker, that automatically determines not only knob values but also when to change knob values, as a part of Waffle. Using a configuration determined by WaffleMaker, Waffle rebuilds the current index without blocking user queries based on a concurrency control scheme. Through extensive experiments, we show that Waffle performed better than the existing methods, and WaffleMaker automatically tuned configuration knob values.
Spatial and Temporal Constrained Ranked Retrieval over Videos [Download Paper] Yueting Chen (York University), Nick Koudas (University of Toronto), Xiaohui Yu (York University)*, Ziqiang Yu (Yantai University) Recent advances in CV algorithms have improved accuracy and efficiency, making video annotations possible for all objects that appeared. In this paper, we utilize the annotated data provided by such algorithms and construct graph representations to capture both object labels and spatial-temporal relationships of objects in videos. We define the problem of Spatial and Temporal Constrained Ranked Retrieval (STAR Retrieval) over videos. Based on the graph representation, we propose a two-phase approach, consisting of the ingestion phase, where we construct and materialize the Graph Index (GI), and the query phase, where we compute the top ranked windows (video clips) according to the window matching score efficiently. We propose two algorithms to perform Spatial Matching (SMA) and Temporal Matching (TM) separately with an early-stopping mechanism. Our experiments demonstrate the effectiveness of the proposed methods, achieving orders of magnitude speedups on queries with high selectivity.
Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles [Download Paper] David Campos (Aalborg University)*, Tung Kieu (Aalborg University), Chenjuan Guo (Aalborg University), Feiteng Huang (Huawei Cloud Database Innovation Lab), Kai Zheng (University of Electronic Science and Technology of China), Bin Yang (Aalborg University), Christian S Jensen (Aalborg University) With the sweeping digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of both accuracy and efficiency. With the objective of achieving such improvements, we propose a diversity-driven, convolutional ensemble. To improve accuracy, the ensemble employs multiple basic outlier detection models built on convolutional sequence-to-sequence autoencoders that can capture temporal dependencies in time series. Further, a novel diversity-driven training method maintains diversity among the basic models, with the aim of improving the ensembleâs accuracy. To improve efficiency, the approach enables a high degree of parallelism during training. In addition, it is able to transfer some model parameters from one basic model to another, which reduces training time. We report on extensive experiments using real-world multivariate time series that offer insight into the design choices underlying the new approach and offer evidence that it is capable of improved accuracy and efficiency.
Evaluating Query Languages and Systems for High-Energy Physics Data [Download Paper] Dan Graur (ETH Zurich), Ingo Müller (Google)*, Mason Proffitt (University of Washington), Ghislain Fourny (ETH Zürich), Gordon T. Watts (University of Washington), Gustavo Alonso (ETHZ) In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms using an HEP benchmark. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are \emph{in principle} viable tools for HEP, significant work remains to make them relevant to HEP researchers.
Incremental Partitioning for Efficient Spatial Data Analytics [Download Paper] Tin Vu (UC Riverside)*, Ahmed Eldawy (University of California, Riverside), Vagelis Hristidis (UC Riverside), Vassilis J. Tsotras (UC Riverside) Big spatial data has become ubiquitous, from mobile applications to satellite data. In most of these applications, data is continuously growing to huge volumes. Existing systems for big spatial data organize records at either the record-level or block-level. Systems that use record-level structures include key-value stores and LSM-Tree stores, which support insert and delete operations and they are optimized for highly-selective queries. On the other hand, systems like GeoSpark that use block-level structures (e.g. 128~MB each) are more efficient for analytical queries, but they cannot incrementally maintain the partitioned data and do not support delete operations. This paper proposes a general framework that enables block-level systems to incrementally maintain spatial partitions, in the presence of bulk insertions and deletions, in distributed file system (DFS) blocks. We first formally study the incremental spatial partitioning problem for big data and demonstrate its NP-hardness. Then, we propose a cost model to estimate the performance of queries on the partitioned data and the effect of modifying it as the data grows. After that, we provide three different implementations of the incremental partitioning framework. Comprehensive experiments on large real datasets show that our proposed partitioning algorithms outperforms state-of-the-art spatial partitioning methods.
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework [Download Paper] [Best Scalable Data Science Paper] Xupeng Miao (Peking University)*, Hailin Zhang (Peking University), Yining Shi (Peking University), Xiaonan Nie (Peking University), Zhi Yang (Peking University), Yangyu Tao (Tencent), Bin Cui (Peking University) Embedding models have been recognized as an effective learning paradigm for high-dimensional data. However, one open issue of embedding models lies in its representation (latent factors) often resulting in large parameter space. We observe existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominate the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68x performance speedup over the state-of-the-art baselines.
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins [Download Paper] Sahaana Suri (Stanford )*, Ihab F Ilyas (U. of Waterloo), Christopher Re (Stanford University), Theodoros Rekatsinas (University of Wisconsin-Madison) Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an index populated with task-specific embeddings. Ember learns these embeddings by leveraging Transformer-based representation learning techniques. We describe our core architectural principles and operators when developing Ember, and empirically demonstrate that Ember allows users to develop no-code context enrichment pipelines for five domains, including search, recommendation and question answering, and can exceed alternatives by up to 39% recall, with as little as a single line configuration change.
Hippo: Sharing Computations in Hyper-Parameter Optimization [Download Paper] Ahnjae Shin (Seoul National University)*, Joo Seong Jeong (Seoul National University), Do Yoon Kim (Seoul National University), Soyoung Jung (Seoul National University), Byung-gon Chun (Seoul National University) Hyper-parameter optimization is crucial for pushing the accuracy of a deep learning model to its limits. However, a hyper-parameter optimization job, referred to as a study, involves numerous trials of training a model using different training knobs, and therefore is very computation-heavy, typically taking hours and days to finish. We observe that trials issued from hyper-parameter optimization algorithms often share common hyper-parameter sequence prefixes. Based on this observation, we propose Hippo, a hyper-parameter optimization system that reuses computation across trials to reduce the overall amount of computation significantly. Instead of treating each trial independently as in existing hyper-parameter optimization systems, Hippo breaks down the hyper-parameter sequences into stages and merges common stages to form a tree of stages (a stage tree). Hippo maintains an internal data structure, search plan, to manage the current status and history of a study, and employs a critical path based scheduler to minimize the overall study completion time. Hippo applies to not only single studies but multi-study scenarios as well. Evaluations show that Hippoâs stage-based execution strategy outperforms trial-based methods for several models and hyper-parameter optimization algorithms, reducing end-to-end training time by up to 2.76Ã (3.53Ã) and GPU-hours by up to 4.81Ã (6.77Ã), for single (multiple) studies.
Optimizing Inference Serving on Serverless Platforms [Download Paper] Ahsan Ali (Aronne National Lab)*, Riccardo Pinciroli (Gran Sasso Science Institute), Feng Yan (University of Nevada, Reno), Evgenia Smirni (College of William and Mary) Serverless computing is an emerging cloud paradigm that implements a pay-per-use cost model and releases users from the burden of managing virtual resources. This becomes tremendously attractive for machine learning (ML) inference serving as it makes autonomous resource scaling robust and easy to use, especially when workloads are bursty. Existing serverless platforms work well for image-based ML inference serving, where requests are homogeneous in service demands. That said, recent advances in natural language processing could not fully benefit from existing serverless platforms as their requests are intrinsically heterogeneous. Batching requests for processing can significantly increase ML serving efficiency while reducing monetary cost, thanks to the pay-per-use pricing model adopted by serverless platforms. Yet, batching heterogeneous ML requests leads to additional computation overhead as small requests need to be "padded" to the same size as large requests within the same batch. Reaching effective batching decisions (i.e., which requests should be batched together and why) is non-trivial: the padding overhead coupled with the serverless auto-scaling forms a complex optimization problem. To address this, we develop Multi-Buffer Serving (MBS), a framework that optimizes the batching of heterogeneous ML inference serving requests to minimize their monetary cost while meeting their service level objectives (SLOs). The core of MBS is a performance and cost estimator driven by analytical models supercharged by a Bayesian optimizer. MBS is prototyped and evaluated on AWS using bursty workloads. Experimental results show that MBS preserves SLOs while outperforming the state-of-the-art by up to 8 x in terms of cost savings while minimizing the padding overhead by up to 37 x with 3 less number of serverless function invocations.
Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [Download Paper] Yuxing Han (Alibaba Group), Ziniu Wu (Massachusetts Institute of Technology), Peizhi Wu (University of Pennsylvania), Rong Zhu (Alibaba Group)*, Jingyi Yang (NTU), Liang Wei Tan (Nanyang Technological University), Kai Zeng (Alibaba Group), Gao Cong (Nanyang Technological Univesity), Yanzhao Qin (Alibaba Group), Andreas Pfadler (Alibaba Group), Zhengping Qian (Alibaba Group), Jingren Zhou (Alibaba Group), Jiangneng Li (Alibaba Group), Bin Cui (Peking University) Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query optimizer in DBMS. In the last decade, an increasing number of advanced CardEst methods (especially ML-based) have been proposed with outstanding estimation accuracy and inference latency. However, there exists no study that systematically evaluates the quality of these methods and answer the fundamental problem: to what extent can these methods improve the performance of query optimizer in real-world settings, which is the ultimate goal of a CardEst method. In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real DBMS. We establish a new benchmark for CardEst, which contains a new complex real-world dataset STATS and a diverse query workload STATS-CEB. We integrate multiple most representative CardEst methods into an open-source DBMS PostgreSQL, and comprehensively evaluate their true effectiveness in improving query plan quality, and other important aspects affecting their applicability. We obtain a number of key findings under different data and query settings. Furthermore, we find that the widely used estimation accuracy metric (Q-Error) cannot distinguish the importance of different sub-plan queries during query optimization and thus cannot truly reflect the generated query plan quality. Therefore, we propose a new metric P-Error to evaluate the performance of CardEst methods, which overcomes the limitation of Q-Error and is able to reflect the overall end-to-end performance of CardEst methods. It could serve as a better optimization objective for future CardEst methods.
TGL: A General Framework for Temporal GNN Training onBillion-Scale Graphs [Download Paper] [Scalable Data Science] Hongkuan Zhou (University of Southern California)*, Da Zheng (Amazon), Israt Nisa (Amazon), Vassilis N. Ioannidis (Amazon Web Services), Xiang Song (Amazon), George Karypis (Amazon) Many real world graphs contain time domain information. Temporal Graph Neural Networks capture temporal information as well as structural and contextual information in the generated dynamic node embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many different tasks. In this work, we propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural Networks with simple configuration files. TGL comprises five main components , a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine. We design a Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to form training mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem of obsolete node memory when training with a large batch size. To address the limitations of current TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a single GPU and the two large datasets with multiple GPUs for both link prediction and node classification tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or better accuracy with an average of 13Ã speedup. Our temporal parallel sampler achieves an average of 173Ã speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one epoch of more than one billion temporal edges within 1-10 hours. To the best of our knowledge, this is the first work that proposes a general frame work for large-scale Temporal Graph Neural Networks training on multiple GPUs.
Misinformation Mitigation under Differential Propagation Rates and Temporal Penalties [Download Paper] Michael Simpson (University of British Columbia)*, Laks V.s. Lakshmanan (The University of British Columbia), Farnoosh Hashemi (The University of British Columbia) We propose an information propagation model that captures important temporal aspects that have been well observed in the dynamics of fake news diffusion, in contrast with the diffusion of truth. The model accounts for differential propagation rates of truth and misinformation and for user reaction times. We study a time-sensitive variant of the \textit{misinformation mitigation} problem, where $k$ seeds are to be selected to activate a truth campaign so as to minimize the number of users that adopt misinformation propagating through a social network. We show that the resulting objective is non-submodular and employ a sandwiching technique by defining submodular upper and lower bounding functions, providing data-dependent guarantees. In order to enable the use of a reverse sampling framework, we introduce a weighted version of reverse reachability sets that captures the associated differential propagation rates and establish a key equivalence between weighted set coverage probabilities and mitigation with respect to the sandwiching functions. Further, we propose an offline reverse sampling framework that provides $(1 - 1/e - \epsilon)$-approximate solutions to our bounding functions and introduce an importance sampling technique to reduce the sample complexity of our solution. Finally, we show how our framework can provide an anytime solution to the problem. Experiments over five datasets show that our approach outperforms previous approaches and is robust to uncertainty in the model parameters.
An In-Depth Study of Continuous Subgraph Matching [Download Paper] Xibo Sun (Hong Kong University of Science and Technology), Shixuan Sun (National University of Singapore)*, Qiong Luo (Hong Kong University of Science and Technology), Bingsheng He (National University of Singapore) Continuous subgraph matching (CSM) algorithms find the occurrences of a given pattern on a stream of data graphs online. A number of incremental CSM algorithms have been proposed. However, a systematical study on these algorithms is missing to identify their advantages and disadvantages on a wide range of workloads. Therefore, we first propose to model CSM as incremental view maintenance (IVM) to capture the design space of existing algorithms. Then, we implement six representative CSM algorithms, including IncIsoMatch, SJ-Tree, Graphflow, IEDyn, TurboFlux, and SymBi, in a common framework based on IVM. We further conduct extensive experiments to evaluate the overall performance of competing algorithms as well as study the effectiveness of individual techniques to pinpoint the key factors leading to the performance differences. We obtain the following new insights into the performance: (1) existing algorithms start the search from an edge in the query graph that maps to an updated data edge, potentially leading to many invalid partial results; (2) all matching orders are based on simple heuristics, which appear ineffective at times; (3) index updates dominate the query time on some queries; and (4) the algorithm with constant delay enumeration bears significant index update cost. Consequently, no algorithm dominate the others in all cases. Therefore, we give a few recommendations based on our experiment results. In particular, the SymBi index is useful for sparse queries or long running queries. The matching orders of IEDyn and TurboFlux work well on tree queries, those of Graphflow on dense queries or when both query and data graphs are sparse, and otherwise, we recommend SymBi's matching orders.
Time-Topology Analysis [Download Paper] Yunkai Lou (Tsinghua University), Chaokun Wang (Tsinghua University)*, Tiankai Gu (Tsinghua University), Hao Feng (Tsinghua University), Jun Chen (Baidu Inc), Jeffrey Xu Yu (Chinese University of Hong Kong) Many real-world networks have been evolving, and are finely modeled as temporal graphs from the viewpoint of the graph theory. A temporal graph is informative, and always contains two types of information, i.e., the temporal information and topological information, where the temporal information reflects the time when the relationships are established, and the topological information focuses on the structure of the graph. In this paper, we perform time-topology analysis on temporal graphs to extract useful information. Firstly, a new metric named $\mathbb{T}$-cohesiveness is proposed to evaluate the cohesiveness of a temporal subgraph. It defines the cohesiveness of a temporal subgraph from the time and topology dimensions jointly. Specifically, given a temporal graph $\mathcal{G}_s = (V_s, \mathcal{E}_s)$, cohesiveness in the time dimension reflects whether the connections in $\mathcal{G}_s$ happen in a short period of time, while cohesiveness in the topology dimension indicates whether the vertices in $V_s$ are densely connected and have few connections with vertices out of $\mathcal{G}_s$. Then, $\mathbb{T}$-cohesiveness is utilized to perform time-topology analysis on temporal graphs, and two time-topology analysis methods are proposed. In detail, $\mathbb{T}$-cohesiveness evolution tracking traces the evolution of the $\mathbb{T}$-cohesiveness of a subgraph, and combo searching finds out all the subgraphs that contain the query vertex and have $\mathbb{T}$-cohesiveness larger than a given threshold. Moreover, a pruning strategy is proposed to improve the efficiency of combo searching. Experimental results confirm the efficiency of the proposed time-topology analysis methods and the pruning strategy.
PGE: Robust Product Graph Embedding Learning for Error Detection [Download Paper] Kewei Cheng (UCLA)*, Xian Li (Amazon), Yifan Xu (Amazon.com), Xin Luna Dong (Meta), Yizhou Sun (UCLA) Although product graphs (PGs) have gained increasing attentions in recent years for their successful applications in product search and recommendations, the extensive power of PGs can be limited by the inevitable involvement of various kinds of errors. Thus, it is critical to validate the correctness of triples in PGs to improve their reliability. Knowledge graph (KG) embedding methods have strong error detection abilities. Yet, existing KG embedding methods may not be directly applicable to a PG due to its distinct characteristics: (1) PG contains rich textual signals, which necessitates a joint exploration of both text information and graph structure; (2) PG contains a large number of attribute triples, in which attribute values are represented by free texts. Since free texts are too flexible to define entities in KGs, traditional way to map entities to their embeddings using ids is no longer appropriate for attribute value representation; (3) Noisy triples in a PG mislead the embedding learning and significantly hurt the performance of error detection. To address the aforementioned challenges, we propose an end-to-end noise-tolerant embedding learning framework, PGE, to jointly leverage both text information and graph structure in PG to learn embeddings for error detection.
A Neural Database for Differentially Private Spatial Range Queries [Download Paper] Sepanta Zeighami (University of Southern California)*, Ritesh Ahuja (University of Southern California), Gabriel Ghinita (Univ. of Massachusetts Boston), Cyrus Shahabi (Computer Science Department. University of Southern California) Mobile apps and location-based services generate large amounts of location data. Location density information from such datasets benefits research on traffic optimization, context-aware notifications and public health (e.g., disease spread). To preserve individual privacy, one must sanitize location data, which is commonly done using differential privacy (DP). Existing methods partition the data domain into bins, add noise to each bin and publish a noisy histogram of the data. However, such simplistic modelling choices fall short of accurately capturing the useful density information in spatial datasets and yield poor accuracy. We propose a machine-learning based approach for answering range count queries on location data with DP guarantees. We focus on countering the sources of error that plague existing approaches (i.e., noise and uniformity error) through learning, and we design a neural database system that models spatial data such that density features are preserved, even when DP-compliant noise is added. We also devise a framework for effective system parameter tuning on top of public data, which helps set important system parameters without expending scarce privacy budget. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approach significantly outperforms the state of the art.
Serving Deep Learning Models with Deduplication from Relational Databases [Download Paper] Lixi Zhou (Arizona State University), Jiaqing Chen (Arizona State University), Amitabh Das (Arizona State University), Hong Min (IBM T.J. Watson Research Center), Lei Yu (IBM), Ming Zhao (ASU School of Computing), Jia Zou (Arizona State University)* Serving deep learning models from relational databases brings significant benefits. First, features extracted from databases do not need to be transferred to any decoupled deep learning systems for inferences, and thus the system management overhead can be significantly reduced. Second, in a relational database, data management along the storage hierarchy is fully integrated with query processing, and thus it can continue model serving even if the working set size exceeds the available memory. Applying model deduplication can greatly reduce the storage space, memory footprint, cache misses, and inference latency. However, existing data deduplication techniques are not applicable to the deep learning model serving applications in relational databases. They do not consider the impacts on model inference accuracy as well as the inconsistency between tensor blocks and database pages. This work proposed synergistic storage optimization techniques for duplication detection, page packing, and caching, to enhance database systems for model serving. Evaluation results show that our proposed techniques significantly improved the storage efficiency and the model inference latency, and outperformed existing deep learning frameworks in targeting scenarios.
Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning [Download Paper] Haoteng Yin (Purdue University)*, Muhan Zhang (Peking University), Yanbang Wang (Cornell University), Jianguo Wang (Purdue University), Pan Li (Purdue University) Subgraph-based graph representation learning (SGRL) has been recently proposed to deal with some fundamental challenges encountered by canonical graph neural networks (GNNs), and has demonstrated advantages in many important data science applications such as link, relation and motif prediction. However, current SGRL approaches suffer from scalability issues since they require extracting subgraphs for each training or testing query. Recent solutions that scale up canonical GNNs may not apply to SGRL. Here, we propose a novel framework SUREL for scalable SGRL by co-designing the learning algorithm and its system support. SUREL adopts walk-based decomposition of subgraphs and reuses the walks to form subgraphs, which substantially reduces the redundancy of subgraph extraction and supports parallel computation. Experiments over six homogeneous, heterogeneous and higher-order graphs with millions of nodes and edges demonstrate the effectiveness and scalability of SUREL. In particular, compared to SGRL baselines, SUREL achieves 10X speed-up with comparable or even better prediction performance; while compared to canonical GNNs, SUREL achieves 50% prediction accuracy improvement.
ConnectorX: Accelerating Data Loading From Databases to Dataframes [Download Paper] [Scalable Data Science] Xiaoying Wang (Simon Fraser University), Weiyuan Wu (Simon Fraser University), Jinze Wu (Simon Fraser University), Yizhou Chen (Simon Fraser University), Nick Zrymiak (Simon Fraser University), Changbo Qu (Simon Fraser University), Lampros Flokas (Columbia University), George Chow (Simon Fraser University), Jiannan Wang (Simon Fraser University)*, Tianzheng Wang (Simon Fraser University), Eugene Wu (Columbia University), Qingqing Zhou (Tencent Inc.) Data is often stored in a database management system (DBMS) but dataframe libraries are widely used among data scientists. An important but challenging problem is how to bridge the gap between databases and dataframes. To solve this problem, we present ConnectorX, a client library that enables fast and memory-efficient data loading from various databases (e.g.,PostgreSQL, MySQL, SQLite, SQLServer, Oracle) to different dataframes (e.g., Pandas, PyArrow, Modin, Dask, and Polars). We first investigate why the loading process is slow and why it consumes large memory. We surprisingly find that the main overhead comes from the client-side rather than query execution and data transfer. We integrate several existing and new techniques to reduce the overhead and carefully design the system architecture and interface to make ConnectorX easy to extend to various databases and dataframes. Moreover, we propose server-side result partitioning that can be adopted by DBMSs in order to better support exporting data to data science tools. We conduct extensive experiments to evaluate ConnectorX and compare it with popular libraries. The results show that ConnectorX significantly outperforms existing solutions. ConnectorX is open sourced at: https://github.com/sfu-db/connector-x.
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [Download Paper] Renzhi Wu (Georgia Institute of Technology)*, Bolin Ding ("Data Analytics and Intelligence Lab, Alibaba Group"), Xu Chu (GATECH), Zhewei Wei (Renmin University of China), Xiening Dai (Alibaba Group), Tao Guan (Alibaba Group), Jingren Zhou (Alibaba Group) Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline)samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our learned estimator online for reproducibility.
Your Read is Our Priority in Flash Storage [Download Paper] Mijin An (Sungkyunkwan University ), Soojun Im (Samsung Electronics Co.), Dawoon Jung (Samsung Electronics Co.), Sang Won Lee (Sungkyunkwan University)* When replacing a dirty victim page upon page miss, the conventional buffer managers flush the dirty victim first to the storage before reading the missing page. This read-after-write (RAW) protocol, unfortunately, causes the read stall problem on flash storage; because of the asymmetric I/O speed and parallelism in flash storage, the clean frames are quickly consumed, so the read for the missing page often has to wait for the slow write to complete and for the frame to be clean due to the resource conflict for the same buffer frame. RAW will thus make the performance-critical synchronous reads often blocked by writes, severely worsening transaction throughput and latency. In addition, its strict I/O ordering will make flash storage with abundant parallelism under-utilized. To avoid read stalls in the DBMS buffer, we propose RW (fused read and write) as a new storage interface. Using RW on read stall, the buffer manager can issue both read and write requests at once to the storage. Then, once the dirty page is copied to the storage buffer, it can immediately serve the read. In addition, to resolve read stalls in the flash storage buffer, we propose R-Buf, where the read buffer is separated from the write buffer so that reads can proceed at no stall. RW and R-Buf, working at different layers, complement each other when used together. We prototype RW and R-Buf on a real Cosmos+ OpenSSD board. Evaluation results show that RW alone improves TPC-C throughput over RAW by 3.2x and, combined with R-Buf, does by 3.9x. In addition, we demonstrate that R-Buf effectively mitigates the I/O interference in multi-tenancy.
New Query Optimization Techniques in the Spark Engine of Azure Synapse [Download Paper] Abhishek Modi (Microsoft), Kaushik Rajan (Microsoft Research)*, Srinivas Thimmaiah (Microsoft), Prakhar Jain (Databricks), Swinky Mann (Microsoft), Ayushi Agarwal (Microsoft), Ajith Shetty (Microsoft), Shahid K I (Microsoft), Ashit Gosalia (Microsoft), Partho Sarthi (Microsoft Research) The cost of big-data query execution is dominated by stateful operators. These include sort and hash-aggregate that typically materialize intermediate data in memory, and exchange that materializes data to disk and transfers data over the network. In this paper we focus on several query optimization techniques that reduce the cost of these operators. First, we introduce a novel exchange placement algorithm that improves the state-of-the-art and significantly reduces the amount of data exchanged. The algorithm simultaneously minimizes the number of exchanges required and maximizes computation reuse via multi-consumer exchanges. Second, we introduce three partial push-down optimizations that push down partial computation derived from existing operators (group-bys, intersections and joins) below these stateful operators. While these optimizations are generically applicable we find that two of these optimizations (partial aggregate and partial semi-join push-down) are only beneficial in the scale-out setting where exchanges are a bottleneck. We propose novel extensions to existing literature to perform more aggressive partial push-downs than the state-of-the- art and also specialize them to the big-data setting. Finally we propose peephole optimizations that specialize the implementation of stateful operators to their input parameters. All our optimizations are implemented in the spark engine that powers azure synapse. We evaluate their impact on TPCDS and demonstrate that they make our engine 1.8× faster than Apache Spark 3.0.1.
Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine [Download Paper] Subarna Chatterjee (Harvard University )*, Meena Jagadeesan (UC Berkeley), Wilson Qin (Harvard), Stratos Idreos (Harvard) We present a self-designing key-value storage engine, Cosine, which can always take the shape of the close to “perfect” engine architecture given an input workload, a cloud budget, a target performance, and required cloud SLAs. By identifying and formalizing the first principles of storage engine layouts and core key-value algorithms, Cosine constructs a massive design space comprising of sextillion (10^36) possible storage engine designs over a diverse space of hardware and cloud pricing policies for three cloud providers – AWS, GCP, and Azure. Cosine spans across diverse designs such as Log-Structured Merge-trees, B-trees, Log-Structured Hash-tables, in-memory accelerators for filters and indexes as well as trillions of hybrid designs that do not appear in the literature or industry but emerge as valid combinations of the above. Cosine includes a unified distribution-aware I/O model and a learned concurrency-aware CPU model that with high accuracy can calculate the performance and cloud cost of any possible design on any workload and virtual machines. Cosine can then search through that space in a matter of seconds to find the best design and materializes the actual code of the resulting storage engine design using a templated Rust implementation. We demonstrate that on average Cosine outperforms state-of-the-art storage engines such as write-optimized RocksDB, read-optimized WiredTiger, and very write-optimized FASTER by 53x, 25x, and 20x, respectively, for diverse workloads, data sizes, and cloud budgets across all YCSB core workloads and many variants.
Evaluating Persistent Memory Range Indexes: Part Two [Download Paper] [Experiment, Analysis & Benchmark] Yuliang He (Simon Fraser University), Duo Lu (Simon Fraser University), Kaisong Huang (Simon Fraser University), Tianzheng Wang (Simon Fraser University)* Scalable persistent memory (PM) has opened up new opportunities for building indexes that operate and persist data directly on the memory bus, potentially enabling instant recovery, low latency and high throughput. When real PM hardware (Intel Optane Persistent Memory) first became available, previous work evaluated PM indexes proposed in the pre-Optane era. Since then, newer indexes based on real PM have appeared, but it is unclear how they compare to each other and to previous proposals, and what further challenges remain. This paper addresses these issues by analyzing and experimentally evaluating state-of-the-art PM range indexes built for real PM. We find that newer designs inherited past techniques with new improvements, but do not necessarily outperform pre-Optane era proposals. Moreover, PM indexes are often very competitive with or even outperform indexes tailored for DRAM, highlighting the potential of using a unified design for both PM and DRAM. Functionality-wise, these indexes still lack good support for variable-length keys and handling NUMA effect. Based on our findings, we distill new design principles and highlight future directions.
Lotus: Scalable Multi-Partition Transactions on Single-Threaded Partitioned Databases [Download Paper] Xinjing Zhou (Massachusetts Institute of Technology)*, Xiangyao Yu (University of Wisconsin-Madison), Goetz Graefe (Google), Michael Stonebraker (MIT) This paper revisits the H-Store/VoltDB concurrency control scheme for partitioned main-memory databases, which we term run-to-completion-single-thread (RCST), with an eye toward improving its poor performance on multi-partition (MP) workloads. The original scheme focused on maximizing single partition (SP) performance, producing results in the millions of transactions per second on modest clusters, but at the expense of dismal MP performance. In this paper, we show that modest changes to the original RCST algorithms can dramatically improve MP performance with very limited impact on SP performance. That makes RCST superior to popular optimistic and pessimistic schemes without optimizations for batch execution, including OCC and 2PL, on a wide range of multi-node workloads with up to 60% throughput improvement. Our second contribution is to propose a multiplexed-execution-single-thread (MEST) algorithm based on RCST to amortize the network stalls from MP transactions over a batch of MP transactions. This scheme delivers up to 21X higher throughput for SP transactions and comparable MP throughput compared to state-of-the-art distributed deterministic concurrency control algorithms that are optimized for batch execution. Finally, our MEST scheme offers dramatically superior performance when straggler transactions are present in the workload. Our conclusion is that the H-Store/VoltDB concurrency control scheme can be dramatically improved and dominates state-of-the-art algorithms over a variety of MP workloads.
Memory-Optimized Multi-Version Concurrency Control for Disk-Based Database Systems [Download Paper] Michael Freitag (TUM)*, Alfons Kemper (TUM), Thomas Neumann (TUM) Pure in-memory database systems offer outstanding performance but degrade heavily if the working set does not fit into DRAM, which is problematic in view of declining main memory growth rates. In contrast, recently proposed memory-optimized disk-based systems such as Umbra leverage large in-memory buffers for query processing but rely on fast solid-state disks for persistent storage. They offer near in-memory performance while the working set is cached, and scale gracefully to arbitrarily large data sets far beyond main memory capacity. Past research has shown that this architecture is indeed feasible for read-heavy analytical workloads. We continue this line of work in the following paper, and present a novel multi-version concurrency control approach that enables a memory-optimized disk-based system to achieve excellent performance on transactional workloads as well. Our approach exploits that the vast majority of versioning information can be maintained entirely in-memory without ever being persisted to stable storage, which minimizes the overhead of concurrency control. Large write transactions for which this is not possible are extremely rare, and handled transparently by a lightweight fallback mechanism. Our experiments show that the proposed approach achieves transaction throughput up to an order of magnitude higher than competing disk-based systems, confirming its viability in a real-world setting.
A Scalable and Generic Approach to Range Joins [Download Paper] Maximilian Reif (Technical University of Munich)*, Thomas Neumann (TUM) Analytical database systems provide great insights into large datasets and are an excellent tool for data exploration and analysis. A central pillar of query processing is the efficient evaluation of equi-joins, typically with linear-time algorithms (e.g. hash joins). However, for many use-cases with location and temporal data, non-equi joins, like range joins, occur in queries. Without special optimizations, this typically results in nested loop evaluation with quadratic complexity. This leads to unacceptable query execution times. Different mitigations have been proposed in the past, like partitioning or sorting the data. While these allow for handling certain classes of queries, they tend to be restricted in the kind of queries they can support. And, perhaps even more importantly, they do not play nice with additional equality predicates that typically occur within a query and that have to be considered, too. In this work, we present a kd-tree-based, multi-dimension range join that supports a very wide range of queries, and that can exploit additional equality constraints. This approach allows us to handle large classes of queries very efficiently, with negligible memory overhead, and it is suitable as a general-purpose solution for range queries in database systems. The join algorithm is fully parallel, both during the build and the probe phase, and scales to large problem instances and high core counts. We demonstrate the feasibility of this approach by integrating it into our database system Umbra and performing extensive experiments with both large real world data sets and with synthetic benchmarks used for sensitivity analysis. In our experiments, it outperforms hand-tuned Spark code and all other database systems that we have tested.
Turbo-Charging SPJ Query Plans with Learned Physical Join Operator Selections [Download Paper] Axel Hertzschuch (Technische Universität Dresden)*, Claudio Hartmann (Technische Universität Dresden), Dirk Habich (TU Dresden), Wolfgang Lehner (TU Dresden) The optimization of select-project-join (SPJ) queries entails two major challenges: (i) finding a good join order and (ii) selecting the best-fitting physical join operator for each single join within the chosen join order. Previous work mainly focuses on the computation of a good join order, but leaves open to which extent the physical join operator selection accounts for plan quality. Our analysis using different query optimizers indicates that physical join operator selection is crucial and that none of the investigated query optimizers reaches the full potential of optimal operator selections. To unlock this potential, we propose TONIC, a novel cardinality estimation-free extension for generic SPJ query optimizers in this paper. TONIC follows a learning-based approach and revises operator decisions for arbitrary join paths based on learned query feedback. To continuously capture and reuse optimal operator selections, we introduce a lightweight yet powerful Query Execution Plan Synopsis (QEP-S). In comparison to related work, TONIC enables transparent planning decisions with consistent performance improvements. Using two real-life benchmarks, we demonstrate that extending existing optimizers with TONIC substantially reduces query response times with a cumulative speedup of up to 2.8x.
Cost Modelling for Optimal Data Placement in Heterogeneous Main Memory [Download Paper] [Experiments, Analyses & Benchmarks] Robert Lasch (TU Ilmenau, SAP SE)*, Thomas Legler (SAP SE), Norman May (SAP SE), Bernhard Scheirle (SAP SE), Kai-uwe Sattler (TU Ilmenau) "The cost of DRAM contributes significantly to the operating costs of in-memory database management systems (IMDBMS). Persistent memory (PMEM) is an alternative type of byte-addressable memory that offers — in addition to persistence — higher capacities than DRAM at a lower price with the disadvantage of increased latencies and reduced bandwidth. This paper evaluates PMEM as a cheaper alternative to DRAM for storing table base data, which can make up a significant fraction of an IMDBMS’ total memory footprint. Using a prototype implementation in the SAP HANA IMDBMS, we find that placing all table data in PMEM can reduce query performance in analytical benchmarks by more than a factor of two, while transactional workloads are less affected. To quantify the performance impact of placing individual data structures in PMEM, we propose a cost model based on a lightweight workload characterization. Using this model, we show how to place data pareto–optimally in the heterogeneous memory. Our evaluation demonstrates the accuracy of the model and shows that it is possible to place more than 75 % of table data in PMEM while keeping performance within 10 % of the DRAM baseline for two analytical benchmarks."
Designing an Open Framework for Query Optimization and Compilation [Download Paper] Michael Jungmair (Technical University of Munich)*, André Kohn (TUM), Jana Giceva (TU Munich) Since its invention, data-centric code generation has been adopted for query compilation by various database systems in academia and industry. These database systems are fast but maximize performance at the expense of developer friendliness, flexibility, and extensibility. Recent advances in the field of compiler construction identified similar issues for domain-specific compilers and introduced a solution with MLIR, a generic infrastructure for domain-specific dialects. We propose a layered query compilation stack based on MLIR with open intermediate representations that can be combined at each layer. We further propose moving query optimization into the query compiler to benefit from existing optimization infrastructure and make cross-domain optimization viable. With LingoDB, we demonstrate that the used approach significantly decreases the implementation effort and is highly flexible and extendible. At the same time, LingoDB achieves high performance and low compilation latencies.
SAFE: A Share-and-Aggregate Bandwidth Exploration Framework for Kernel Density Visualization [Download Paper] Tsz Nam Chan (Hong Kong Baptist University)*, Pak Lon Ip (University of Macau), Leong Hou U (University of Macau), Byron Choi (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University) Kernel density visualization (KDV) has been the de facto method in many spatial analysis tasks, including ecological modeling, crime hotspot detection, traffic accident hotspot detection, and disease outbreak detection. In these tasks, domain experts usually generate multiple KDVs with different bandwidth values. However, generating a single KDV, let alone multiple KDVs, is time-consuming. In this paper, we develop a share-and-aggregate framework, namely SAFE, to reduce the time complexity of generating multiple KDVs given a set of bandwidth values. On the other hand, domain experts can specify bandwidth values on the fly. To tackle this issue, we further extend SAFE and develop the exact method SAFE$_\text{all}$ and the 2-approximation method SAFE$_\text{exp}$ which reduce the time complexity under this setting. Experimental results on four large-scale datasets (up to 4.33M data points) show that these three methods achieve at least one-order-of-magnitude speedup for generating multiple KDVs in most of the cases without degrading the visualization quality.
SWS: A Complexity-Optimized Solution for Spatial-Temporal Kernel Density Visualization [Download Paper] Tsz Nam Chan (Hong Kong Baptist University)*, Pak Lon Ip (University of Macau), Leong Hou U (University of Macau), Byron Choi (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University) Spatial-temporal kernel density visualization (STKDV) has been extensively used in a wide range of applications, e.g., disease outbreak detection, traffic accident hotspot detection, and crime hotspot detection. While STKDV can provide accurate and comprehensive data visualization, computing STKDV is time-consuming, which is not scalable to large-scale datasets. To address this issue, we develop a new sliding-window-based solution (SWS), which theoretically reduces the time complexity for generating STKDV, without increasing the space complexity. Moreover, we incorporate SWS with the progressive visualization framework, which can continuously output partial visualization results to the users (from coarse to fine), until the users satisfy the visualization. Our experimental studies on five large-scale datasets show that SWS achieves 1.71x to 24x speedup, compared with the state-of-the-art methods.
NLC: Search Correlated Window Pairs on Long Time Series [Download Paper] Shuye Pan (Fudan University)*, Peng Wang (" Fudan University, China"), Chen Wang (" Tsinghua University, China"), Wei Wang (" Fudan University, China"), Jianmin Wang ("Tsinghua University, China") Nowadays, many applications, like Internet of Things and Industrial Internet, collect data points from sensors continuously to form long time series. Finding correlation between time series is a fundamental task for many time series mining problems. However, most existing works in this area are either limited in the type of detected relations, like only the linear correlations, or not handling the complex temporal relations, like not considering the unaligned windows or variable window lengths. In this paper, we propose an efficient approach, Non-Linear Correlation search (NLC), to search the correlated window pairs on two long time series. Firstly, we propose two strategies, window shrinking and window extending, to quickly find the high-quality candidates of correlated window pairs. Then, we refine the candidates by a nested one-dimensional search approach. We conduct a systematic empirical study to verify the efficiency and effectiveness of our approach over both synthetic and real-world datasets.
DyHealth: Making Neural Networks Dynamic for Effective Healthcare Analytics [Download Paper] [Industry] Kaiping Zheng (National University of Singapore), Shaofeng Cai (National University of Singapore), Horng-Ruey Chua (National University Hospital), Melanie Herschel (Universität Stuttgart), Meihui Zhang (Beijing Institute of Technology), Beng Chin Ooi (NUS)* In National University Hospital (NUH) in Singapore, we conduct healthcare analytics that analyzes heterogeneous electronic medical records (EMR) to support effective clinical decision-making on a daily basis. Existing work mainly focuses on multimodality for extracting complementary information from different modalities, and/or interpretability for providing interpretable prediction results. However, real-world healthcare analytics has presented another major challenge, i.e., the available modalities evolve or change intermittently. Addressing this challenge requires deployed models to be adaptive to such dynamic modality changes. To meet the aforementioned requirement, we develop a modular, multimodal and interpretable framework DyHealth to enable dynamic healthcare analytics in clinical practice. Specifically, different modalities are processed within their respective data modules that adhere to the interface defined by DyHealth. The extracted information from different modalities is integrated subsequently in our proposed Multimodal Fusion Module in DyHealth. In order to better handle modality changes at runtime, we further propose exponential increasing/decreasing mechanisms to support modality "hot-plug". We also devise a novel modality-based attention mechanism for providing fine-grained interpretation results on a per-input basis. We conduct a pilot evaluation of DyHealth on the patients' EMR data from NUH, in which DyHealth achieves superior performance and therefore, is promising to roll out for hospital-wide deployment. We also validate DyHealth in two public EMR datasets. Experimental results confirm the effectiveness, flexibility, and extensibility of DyHealth in supporting multimodal and interpretable healthcare analytics.
DESIRE: An Efficient Dynamic Cluster-based Forest Indexing for Similarity Search in Multi-Metric Spaces [Download Paper] Yifan Zhu (Zhejiang University), Lu Chen (Zhejiang University), Yunjun Gao (Zhejiang University)*, Baihua Zheng (Singapore Management University), Pengfei Wang (Zhejiang University) Similarity search finds similar objects for a given query object based on a certain similarity metric. Similarity search in metric spaces has attracted increasing attention, as the metric space can accommodate any type of data and support flexible distance metrics. However, a metric space only models a single data type with a specific similarity metric. In contrast, a multi-metric space combines multiple metric spaces to simultaneously model a variety of data types and a collection of associated similarity metrics. Thus, a multi-metric space is capable of performing similarity search over any combination of metric spaces. Many studies focus on indexing a single metric space, while only a few aims at indexing multi-metric space to accelerate similarity search. In this paper, we propose DESIRE, an efficient dynamic cluster-based forest index for similarity search in multi-metric spaces. DESIRE first selects high-quality centers to cluster objects into compact regions, and then employs B+-trees to effectively index distances between centers and corresponding objects. To support dynamic scenarios, efficient update strategies are developed. Further, we provide filtering techniques to accelerate similarity queries in multi-metric spaces. Extensive experiments on four real datasets demonstrate the superior efficiency and scalability of our proposed DESIRE compared with the state-of-the-art multi-metric space indexes.
User-Defined Operators: Efficiently Integrating Custom Algorithms into Modern Databases [Download Paper] Moritz Sichert (Technische Universität München)*, Thomas Neumann (TUM) In recent years, complex data mining and machine learning algorithms have become more common in data analytics. Several specialized systems exist to evaluate these algorithms on ever-growing data sets, which are built to efficiently execute different types of complex analytics queries. However, using these various systems comes at a price. Moving data out of traditional database systems is often slow as it requires exporting and importing data, which is typically performed using the relatively inefficient CSV format. Additionally, database systems usually offer strong ACID guarantees, which are lost when adding new, external systems. This disadvantage can be detrimental to the consistency of the results. Most data scientists still prefer not to use classical database systems for data analytics. The main reason why RDBMS are not used is that SQL is difficult to work with due to its declarative and set-oriented nature, and is not easily extensible. We present User-Defined Operators (UDOs) as a concept to include custom algorithms into modern query engines. Users can write idiomatic code in the programming language of their choice, which is then directly integrated into existing database systems. We show that our implementation can compete with specialized tools and existing query engines while retaining all beneficial properties of the database system.
Replicated Layout for In-Memory Database Systems [Download Paper] Sivaprasad Sudhir (MIT)*, Michael Cafarella (MIT CSAIL), Samuel Madden (MIT) Scanning and filtering are the foundations of analytical database systems. Modern DBMSs employ a variety of techniques to partition and layout data to improve the performance of these operations. To accelerate query performance, systems tune data layout to reduce the cost of accessing and processing data. However, these layouts optimize for the average query, and with heterogeneous data access patterns in parts of the data, their performance degrades. To mitigate this, we present CopyRight, a layout-aware partial replication engine that replicates parts of the data differently and lays out each replica differently to maximize the overall query performance. Across a range of real-world query workloads, CopyRight is able to achieve 1.1X to 7.9X faster performance than the best non-replicated layout with 0.25X space overhead. When compared to full table replication with 100% overhead, CopyRight attains the same or up to 5.2X speedup with 25% space overhead.
The next 50 Years in Database Indexing or: The Case for Automatically Generated Index Structures [Download Paper] Jens Dittrich (Saarland University, Saarland Informatics Campus)*, Joris Nix ( Saarland University, Saarland Informatics Campus), Christian Schön (Saarland University, Saarland Informatics Campus) "Index structures are a building block of query processing and computer science in general. Since the dawn of computer technology there have been index structures. And since then, a myriad of index structures are being invented and published each and every year. In this paper we argue that the very idea of “inventing an index” is a misleading concept in the first place. It is the analogue of “inventing a physical query plan”. This paper is a paradigm shift in which we propose to drop the idea to handcraft index structures (as done for binary search trees over B-trees to any form of learned index) altogether. We present a new automatic index breeding framework coined Genetic Generic Generation of Index Structures (GENE). It is based on the observation that almost all index structures are assembled along three principal dimensions: (1) structural building blocks, e.g., a B-tree is assembled from two different structural node types (inner and leaf nodes), (2) a couple of invariants, e.g., for a B-tree all paths have the same length, and (3) decisions on the internal layout of nodes (row or column layout, etc.). We propose a generic indexing framework that can mimic many existing index structures along those dimensions. Based on that framework we propose a generic genetic index generation algorithm that, given a workload and an optimization goal, can automatically assemble and mutate, in other words ‘breed’ new index structure ‘species’. In our experiments we follow multiple goals. We reexamine some good old wisdom from database technology. Given a specific workload, will GENE even breed an index that is equivalent to what our textbooks and papers currently recommend for such a workload? Or can we do even more? Our initial results strongly indicate that generated indexes are the next step in designing index structures."
A Critical Analysis of Recursive Model Indexes [Download Paper] [Experiment, Analysis & Benchmark] Marcel Maltry (Saarland University)*, Jens Dittrich (Saarland University, Saarland Informatics Campus) The recursive model index (RMI) has recently been introduced as a machine-learned replacement for traditional indexes over sorted data, achieving remarkably fast lookups. Follow-up work focused on explaining RMI's performance and automatically configuring RMIs through enumeration. Unfortunately, configuring RMIs involves setting several hyperparameters, the enumeration of which is often too time-consuming in practice. Therefore, in this work, we conduct the first inventor-independent broad analysis of RMIs with the goal of understanding the impact of each hyperparameter on performance. In particular, we show that in addition to model types and layer size, error bounds and search algorithms must be considered to achieve the best possible performance. Based on our findings, we develop a simple-to-follow guideline for configuring RMIs. We evaluate our guideline by comparing the resulting RMIs with a number of state-of-the-art indexes, both learned and traditional. We show that our simple guideline is sufficient to achieve competitive performance with other learned indexes and RMIs whose configuration was determined using an expensive enumeration procedure. In addition, while carefully reimplementing RMIs, we are able to improve the build time by 2.5x to 6.3x.
Magma: A high data density storage engine used in Couchbase [Download Paper] [Industry] Sarath Lakshman (Couchbase)*, Apaar Gupta (Couchbase Inc.), Rohan Suri (Couchbase), Scott D Lashley (Couchbase), John Liang (Couchbase Inc), Srinath Duvuru (Couchbase), Ravi Mayuram (Couchbase) We present Magma, a write-optimized high data density key-value storage engine used in the Couchbase NoSQL distributed document database. Todayâs write-heavy data-intensive applications like ad-serving, internet-of-things, messaging, and online gaming, generate massive amounts of data. As a result, the requirement for storing and retrieving large volumes of data has grown rapidly. Distributed databases that can scale out horizontally by adding more nodes can be used to serve the requirements of these internet-scale applications. To maintain a reasonable cost of ownership, we need to improve storage efficiency in handling large data volumes per node, such that we donât have to rely on adding more nodes. Our current generation storage engine, Couchstore is based on a log-structured append-only copy-on-write B+Tree architecture. To make substantial improvements to support higher data density and write throughput, we needed a storage engine architecture that lowers write amplification and avoids compaction operations that rewrite the whole database files periodically. We introduce Magma, a hybrid key-value storage engine that combines LSM Trees and a segmented log approach from log-structured file systems. We present a novel approach to performing garbage collection of stale document versions avoiding index lookup during log segment compaction. This is the key to achieving storage efficiency for Magma and eliminates the need for random I/Os during compaction. Magma offers significantly lower write amplification, scalable incremental compaction, and lower space amplification while not regressing the read amplification. Through the efficiency improvements, we improved the single machine data density supported by the Couchbase Server by 3.3x and lowered the memory requirement by 10x, thereby reducing the total cost of ownership up to 10x. Our evaluation results show that Magma outperforms Couchstore and RocksDB in write-heavy workloads.
Ranked Enumeration of Join Queries with Projections [Download Paper] Shaleen Deep (University of Wisconsin-Madison)*, Xiao Hu (Duke University), Paraschos Koutris (University of Wisconsin-Madison) Join query evaluation with ordering is a fundamental data processing task in relational database management systems. \textsf{SQL} and custom graph query languages such as \textsf{Cypher} offer this functionality by allowing users to specify the order via the \sqlhighlight{ORDER BY} clause. In many scenarios, the users also want to see the first $k$ results quickly (expressed by the \sqlhighlight{LIMIT} clause), but the value of $k$ is not predetermined \reva{as user queries are arriving in an online fashion}. Recent work has made considerable progress in identifying optimal algorithms for ranked enumeration of join queries that do \emph{not} contain any projections. In this paper, we initiate the study of the problem of enumerating results in ranked order for queries {\em with projections}. Our main result shows that for any acyclic query, it is possible to obtain a near-linear (in the size of the database) delay algorithm after only a linear time preprocessing step for two important ranking functions: sum and lexicographic ordering. For a practical subset of acyclic queries known as star queries, we show an even stronger result that allows a user to obtain a smooth tradeoff between faster answering time guarantees using more preprocessing time. Our results are also extensible to queries containing cycles and unions. We also perform a comprehensive experimental evaluation to demonstrate that our algorithms, which are simple to implement, improve up to three orders of magnitude in the running time over state-of-the-art algorithms implemented within open-source RDBMS and specialized graph databases.
Shortest-Path Queries on Complex Networks: Experiments, Analyses, and Improvement [Download Paper] Junhua Zhang (UTS), Wentao Li (University of Technology Sydney)*, Long Yuan (Nanjing University of Science and Technology), Lu Qin (UTS), Ying Zhang (University of Technology Sydney), Lijun Chang (The University of Sydney) Shortest-path queries, as a basic operation in complex networks, have plentiful applications. One option for answering shortest-path queries is to use online traversal, such as breadth-first search, but this results in excessively long query time. Another option is to extend the existing index-based methods for handling shortest-distance queries to support shortest-path queries. However, the extra space required by the extension causes the total index size to be too large. To achieve an elegant trade-off between query time and index size, we propose a new index-based approach, Monotonic Landmark Labeling (MLL), to process shortest-path queries. MLL works by decomposing the shortest path between two vertices into several subpaths, which are then indexed. At query time, the shortest path can be found efficiently by finding and splicing the subpaths. We verify that the MLL index is small, and we propose a parallel algorithm for creating it efficiently. Extensive experiments show that the MLL index size is bounded by 23 GB on all tested graphs, and the queries can be answered within 2 milliseconds, even on billion-scale graphs.
Reliable Community Search in Dynamic Networks [Download Paper] Yifu Tang (Deakin University), Jianxin Li (Deakin University)*, Nur Al Hasan Haldar (The University of Western Australia), Ziyu Guan (Xidian University), Jiajie Xu (Soochow University), Chengfei Liu (Swinburne University of Technology) Searching for local communities is an important research problem that supports advanced data analysis in various complex networks, such as social networks, collaboration networks, cellular networks, etc. The evolution of such networks over time has motivated several recent studies to identify local communities in dynamic networks. However, these studies only utilize the aggregation of disjoint structural information to measure the quality and ignore the reliability of the communities in a continuous time interval. To fill this research gap, we propose a novel (theta, k)-core reliable community (CRC) model in the weighted dynamic networks, and define the problem of most reliable community search that couples the desirable properties of connection strength, cohesive structure continuity, and the maximal member engagement. To solve this problem, we first develop a novel edge filtering based online CRC search algorithm that can effectively filter out the trivial edge information from the networks while searching for a reliable community. Further, we propose an index structure, Weighted Core Forest-Index (WCF-index), and devise an index-based dynamic programming CRC search algorithm, that can prune a large number of insignificant intermediate results and support efficient query processing. Finally, we conduct extensive experiments systematically to demonstrate the efficiency and effectiveness of our proposed algorithms on eight real datasets under various experimental settings.
FHL-Cube: Multi-Constraint Shortest Path Querying with Flexible Combination of Constraints [Download Paper] Ziyi Liu (The University of Queensland)*, Lei Li (The Hong Kong University of Science and Technology (Guang Zhou)), Mengxuan Zhang (Iowa State University), Wen Hua (The University of Queensland), Xiaofang Zhou (The Hong Kong University of Science and Technology) Multi-Constraint Shortest Path (MCSP) generalizes the classic shortest path from single to multiple criteria such that more personalized needs can be satisfied. However, MCSP query is essentially a high-dimensional skyline problem and thus time-consuming to answer. Although the current Forest Hop Labeling (FHL) index can answer MCSP efficiently, it takes a long time to construct and lacks the flexibility to handle arbitrary criteria combinations. In this paper, we propose a skyline-cube-based FHL index that can handle the flexible MCSP efficiently. Firstly, we analyze the relation between low and high-dimensional skyline paths theoretically and use a cube to organize them hierarchically. After that, we propose methods to derive the high-dimensional path from the lower ones, which can adapt to the flexible scenario naturally and reduce the expensive high dimensional path concatenation. Then we introduce efficient methods for both single and multi-hop cube concatenations and propose pruning methods to further alleviate the computation. Finally, we improve the FHL structure with a lower height for faster construction and query. Experiments on real-life road networks demonstrate the superiority of our method over the state-of-the-art.
Dynamic Spanning Trees for Connectivity Queries on Fully-dynamic Undirected Graphs [Download Paper] Qing Chen (University of Zürich)*, Oded Lachish (Birkbeck, University of London), Sven Helmer (University of Zurich), Michael H Böhlen (University of Zurich) Answering connectivity queries is fundamental to fully-dynamic graphs where edges and vertices are inserted and deleted frequently. Existing works propose data structures and algorithms with worst case guarantees. We propose a new data structure, the dynamic tree (D-tree), together with algorithms to construct and maintain it. The D-tree is the first data structure that scales to fully-dynamic graphs with millions of vertices and edges and, on average, answers connectivity queries much faster than data structures with worst case guarantees.
Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks [Best Regular Research Paper] [Download Paper] Jingshu Peng (The Hong Kong University of Science and Technology)*, Zhao Chen (Hong Kong University of Science and Technology), Yingxia Shao (BUPT), Yanyan Shen (Shanghai Jiao Tong University), Lei Chen (Hong Kong University of Science and Technology), Jiannong Cao (The Hong Kong Polytechnic University) Graph neural networks (GNNs) have emerged due to their success at modeling graph data. Yet, it is challenging for GNNs to efficiently scale to large graphs. Thus, distributed GNNs come into play. To avoid communication caused by expensive data movement between workers, we propose Sancus, a staleness-aware communication-avoiding decentralized GNN system. By introducing a set of novel bounded embedding staleness metrics and adaptively skipping broadcasts, Sancus abstracts decentralized GNN processing as sequential matrix multiplication and uses historical embeddings via cache. Theoretically, we show bounded approximation errors of embeddings and gradients with convergence guarantee. Empirically, we evaluate Sancus with common GNN models via different system setups on large-scale benchmark datasets. Compared to SOTA works, Sancus can avoid up to 74% communication with at least 1.86Ã faster throughput on average without accuracy loss.
Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale [Download Paper] Yang Li (Peking University)*, Yu Shen (Peking University), Huaijun Jiang (Peking University), Wentao Zhang (Peking University), Jixiang Li (Kuaishou Inc.), Ji Liu (Kwai Inc.), Ce Zhang (ETH), Bin Cui (Peking University) The ever-growing demand and complexity of machine learning are putting pressure on hyper-parameter tuning systems: while the evaluation cost of models continues to increase, the scalability of state-of-the-arts starts to become a crucial bottleneck. In this paper, inspired by our experience deploying hyper-parameter tuning in a real-world application in production and the limitations of existing systems, we propose Hyper-Tune, an efficient and robust distributed hyper-parameter tuning framework. Compared with existing systems, Hyper-Tune highlights multiple system optimizations, including(1) automatic resource allocation, (2) asynchronous scheduling, and (3) multi-fidelity optimizer. We conduct extensive evaluations on both benchmark datasets and a large-scale real-world dataset in production. Empirically, we show that, with the aid of these optimizations, Hyper-Tune outperforms competitive hyper-parameter tuning systems on a wide range of scenarios, including XGBoost, CNN, RNN, and some architectural hyper-parameters for neural networks. Compared with the state-of-the-art BOHB and A-BOHB, we show that Hyper-Tuneachieves up to11.2Ãand5.1Ãspeedups, respectively.
Optimizing Machine Learning Inference Queries with Correlative Proxy Models [Download Paper] Zhihui Yang (Zhejiang Lab)*, Zuozhi Wang (U C IRVINE), Yicong Huang (UC Irvine), Yao Lu (Microsoft Research), Chen Li (UC Irvine), X. Sean Wang (Fudan University) We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is.
Enabling SQL-based Training Data Debugging for Federated Learning [Download Paper] Yejia Liu (Simon Fraser University), Weiyuan Wu (Simon Fraser University)*, Lampros Flokas (Columbia University), Jiannan Wang (Simon Fraser University), Eugene Wu (Columbia University) How can we debug a logistic regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers’ loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework automatically removes the label errors from training data such that the unexpected behavior disappears in the retrained model. In this paper, we enable this powerful framework for federated learning. The key challenge is how to develop a security protocol for federated debugging which is proved to be secure, efficient, and accurate. Achieving this goal requires us to investigate how to seamlessly integrate the techniques from multiple fields (Databases, Machine Learning, and Cybersecurity). We first propose FedRain, which extends Rain, the state-of-the-art SQL-based training data debugging framework, to our federated learning setting. We address several technical challenges to make FedRain work and analyze its security guarantee and time complexity. The analysis results show that FedRain falls short in terms of both efficiency and security. To overcome these limitations, we redesign our security protocol and propose Frog, a novel SQL-based training data debugging framework tailored for federated learning. Our theoretical analysis shows that Frog is more secure, more accurate, and more efficient than FedRain. We conduct extensive experiments using several real-world datasets and a case study. The experimental results are consistent with our theoretical analysis and validate the effectiveness of Frog in practice.
Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction [Download Paper] Benjamin Hilprecht (TU Darmstadt)*, Carsten Binnig (TU Darmstadt) In this paper, we introduce zero-shot cost models, which enable learned cost estimation that generalizes to unseen databases. In contrast to state-of-the-art workload-driven approaches, which require to execute a large set of training queries on every new database, zero-shot cost models thus allow to instantiate a learned cost model out-of-the-box without expensive training data collection. To enable such zero-shot cost models, we suggest a new learning paradigm based on pre-trained cost models. As core contributions to support the transfer of such a pre-trained cost model to unseen databases, we introduce a new model architecture and representation technique for encoding query workloads as input to those models. As we will show in our evaluation, zero-shot cost estimation can provide more accurate cost estimates than state-of-the-art models for a wide range of (real-world) databases without requiring any query executions on unseen databases. Furthermore, we show that zero-shot cost models can be used in a few-shot mode that further improves their quality by retraining them just with a small number of additional training queries on the unseen database.
Anomaly Detection in Time Series: A Comprehensive Evaluation [Download Paper] [Experiment, Analysis & Benchmark Papers] Sebastian Schmidl (Hasso Plattner Institute, University of Potsdam), Phillip Wenig (Hasso Plattner Institute, University of Potsdam)*, Thorsten Papenbrock (Philipps University of Marburg) Detecting anomalous subsequences in time series data is an important task in areas ranging from manufacturing processes over finance applications to health care monitoring. An anomaly can indicate important events, such as production faults, delivery bottlenecks, system defects, or heart flicker, and is therefore of central interest. Because time series are often large and exhibit complex patterns, data scientists have developed various specialized algorithms for the automatic detection of such anomalous patterns. The number and variety of anomaly detection algorithms has grown significantly in the past and, because many of these solutions have been developed independently and by different research communities, there is no comprehensive study that systematically evaluates and compares the different approaches. For this reason, choosing the best detection technique for a given anomaly detection task is a difficult challenge. This comprehensive, scientific study carefully evaluates most state-of-the-art anomaly detection algorithms. We collected and re-implemented 71 anomaly detection algorithms from different domains and evaluated them on 976 time series datasets. The algorithms have been selected from different algorithm families and detection approaches to represent the entire spectrum of anomaly detection techniques. In the paper, we provide a concise overview of the techniques and their commonalities; we evaluate their individual strengths and weaknesses and, thereby, consider factors, such as effectiveness, efficiency, and robustness. Our experimental results should ease the algorithm selection problem and open up new research directions.
HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approximate Nearest Neighbor Search [Download Paper] Kejing Lu (Nagoya University)*, Mineichi Kudo (Hokkaido University), Chuan Xiao (Osaka University and Nagoya University), Yoshiharu Ishikawa (Nagoya University) Approximate nearest neighbor search (ANNS) is a fundamental problem that has a wide range of applications in information retrieval and data mining. Among state-of-the-art in-memory ANNS methods, graph-based methods have attracted particular interest owing to their superior efficiency and query accuracy. Most of these methods focus on the selection of edges to shorten the searching path, but do not pay much attention to the computational cost at each hop. To reduce the cost, we propose a novel graph structure called HVS. HVS has a hierarchical structure of multiple layers that corresponds to a series of subspace divisions in a coarse-to-fine manner. In addition, we utilize a virtual Voronoi diagram in each layer to accelerate the search. By traversing Voronoi cells, HVS can reach the nearest neighbors of a given query efficiently, resulting in a reduction in the total searching cost. Experiments confirm that HVS is superior to other state-of-the-art graph-based methods.
The Inherent Time Complexity and An Efficient Algorithm for Subsequence Matching Problem [Download Paper] Zemin Chao (Harbin institute of technology)*, Hong Gao (Harbin Institute of Technology), Yinan An (Harbin institute of technology), Jianzhong Li (Harbin Institute of Technology) Subsequence matching is an important and fundamental problem on time series data. This paper studies the inherent time complexity of the subsequence matching problem and designs a more efficient algorithm for solving the problem. Firstly, it is proved that the subsequence matching problem is incomputable in time $O(n^{1-\delta})$ even allowing polynomial time preprocessing if the hypothesis SETH is true, where $n$ is the size of the input time series and $0 \leq \delta 1$, i.e. , the inherent complexity of the subsequence matching problem is $\omega (n^{1-\delta})$. Secondly, an efficient algorithm for subsequence matching problem is proposed. In order to improve the efficiency of the algorithm, we design a new summarization method as well as a novel index for series data. The proposed algorithm supports both Euclidean Distance and DTW distance with or without \textit{z}-normalization. Experimental results show that the proposed algorithm is up to about 3 $\sim$ 10 times faster than the state of art algorithm on the constrained \textit{z}-normalized Euclidean Distance and DTW distance, and is up to 7 $\sim$ 12 times faster on Euclidean Distance.
Fast Dataset Search with Earth Mover's Distance [Download Paper] Wenzhe Yang (Wuhan University)*, Sheng Wang (Wuhan University), Yuan Sun (The University of Melbourne), Zhiyong Peng (" Wuhan University, China") The amount of spatial data in open data portals has increased rapidly, raising the demand for spatial dataset search in large data repositories. In this paper, we tackle spatial dataset search by using the Earth Mover's Distance (EMD) to measure the similarity between datasets. EMD is a robust similarity measure between two distributions and has been successfully applied to multiple domains such as image retrieval, document retrieval, multimedia, etc. However, the existing EMD-based studies typically depend on a common filtering framework with a single pruning strategy, which still has a high search cost. To address this issue, we propose a Dual-Bound Filtering (DBF) framework to accelerate the EMD-based spatial dataset search. Specifically, we represent datasets by Z-order histograms and organize them as nodes in a tree structure. During a query, two levels of filtering are conducted based on pooling-based bounds and a TICT bound on EMD to prune dissimilar datasets efficiently. We conduct experiments on four real-world spatial data repositories and the experimental results demonstrate the efficiency and effectiveness of our DBF framework.
Stingy Sketch: A Sketch Framework for Accurate and Fast Frequency Estimation [Download Paper] Haoyu Li (Peking University)*, Qizhi Chen (Peking University), Yixin Zhang (Peking University), Tong Yang (Peking University), Bin Cui (Peking University) Recording the frequency of items in highly skewed data streams is a fundamental and hot problem in recent years. The literature demonstrates that sketch is the most promising solution. The typical metrics to measure a sketch are accuracy and speed, and existing sketches make only trade-offs between the two dimensions. Differently, we aim to optimize both the accuracy and speed at the same time. In this paper, we propose a new sketch framework called Stingy sketch with two key techniques: Bit-pinching Counter Tree(BCTree) and Prophet Queue (PQueue). The key idea of BCTree is to split a large fixed-size counter into many small nodes of a tree structure, and to use a precise encoding to perform carry-in operations with low processing overhead. The key idea of PQueueis to use pipelined prefetch technique to make most memory accesses happen in L2 cache without losing precision. Importantly,the two techniques are cooperative so that Stingy sketch can improve accuracy and speed simultaneously. Extensive experimental results show that Stingy sketch is up to 50% more accurate than the state of the art (SOTA) of accuracy-oriented sketches and is up to 33% faster than the SOTA of speed-oriented sketches.
Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection [Download Paper] John Paparrizos (University of Chicago)*, Paul Boniol (Université de Paris), Themis Palpanas (Université Paris Cité), Ruey Tsay (University of Chicago), Aaron J Elmore (University of Chicago), Michael Franklin (University of Chicago) Anomaly detection (AD) is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outliers spanning multiple observations). Nevertheless, it is common to use traditional information retrieval measures, such as Precision, Recall, and F-score, to assess the quality of methods by thresholding the anomaly score of each point to mark it as an anomaly or not. However, mapping discrete labels into continuous data introduces unavoidable shortcomings, complicating the evaluation of range-based contextual and collective anomalies. Notably, the choice of evaluation measure may significantly bias the experimental outcome. Despite over six decades of attention, there has never been a large-scale systematic quantitative and qualitative analysis of time-series AD evaluation measures to the best of our knowledge. This paper extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios. Our results indicate that measures producing quality values independently of a threshold (i.e., AUC-ROC and AUC-PR) are more suitable for time-series AD. Motivated by this observation, we first extend the AUC-based measures to account for range-based anomalies. Then, we introduce a new family of parameter-free and threshold-independent measures, VUS (Volume Under the Surface), to evaluate methods while varying parameters. Our findings demonstrate that our four measures are significantly more robust and helpful in assessing and separating the quality of time-series AD methods. Interestingly, VUS measures are applicable across binary classification tasks for evaluating methods under different parameter choices.
Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting [Download Paper] Zezhi Shao (Institute of Computing Technology, Chinese Academy of Sciences)*, Zhao Zhang (Institute of Computing Technology, Chinese Academy of Sciences), Wei Wei (Huazhong University of Science and Technology), Fei Wang (Institute of Computing Technology, Chinese Academy of Sciences), Yongjun Xu (Institute of Computing Technology, Chinese Academy of Sciences), Xin Cao (University of New South Wales), Christian S Jensen (Aalborg University) We all depend on mobility, and vehicular transportation affects the daily lives of most of us. Thus, the ability to forecast the state of traffic in a road network is an important functionality and a challenging task. Traffic data is often obtained from sensors deployed in a road network. Recent proposals on spatial-temporal graph neural networks have achieved great progress at modeling complex spatialtemporal correlations in traffic data, by modeling traffic data as a diffusion process. However, intuitively, traffic data encompasses two different kinds of hidden time series signals, namely the diffusion signals and inherent signals. Unfortunately, nearly all previous works coarsely consider traffic signals entirely as the outcome of the diffusion, while neglecting the inherent signals, which impacts model performance negatively. To improve modeling performance, we propose a novel Decoupled Spatial Temporal Framework (DSTF) that separates the diffusion and inherent traffic information in a data-driven manner, which encompasses a unique estimation gate and a residual decomposition mechanism. The separated signals can be handled subsequently by the diffusion and inherent modules separately. Further, we propose an instantiation of DSTF, Decoupled Dynamic Spatial-Temporal Graph Neural Network (D2STGNN), that captures spatial-temporal correlations and also features a dynamic graph learning module that targets the learning of the dynamic characteristics of traffic networks. Extensive experiments with four real-world, large-scale traffic datasets demonstrate that the framework is capable of advancing the state-of-the-art.
TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection [Download Paper] John Paparrizos (University of Chicago)*, Yuhao Kang (University of Chicago), Paul Boniol (Université de Paris), Ruey Tsay (University of Chicago), Themis Palpanas (Université Paris Cité), Michael Franklin (University of Chicago) The detection of anomalies in time series has gained ample academic and industrial attention. However, no comprehensive benchmark exists to evaluate time-series anomaly detection methods. It is common to use (i) proprietary or synthetic data, often biased to support particular claims; or (ii) a limited collection of publicly available datasets. Consequently, we often observe methods performing exceptionally well in one dataset but surprisingly poorly in another, creating an illusion of progress. To address the issues above, we thoroughly studied over one hundred papers to identify, collect, process, and systematically format datasets proposed in the past decades. We summarize our effort in TSB-UAD, a new benchmark to ease the evaluation of univariate time-series anomaly detection methods. Overall, TSB-UAD contains 13766 time series with labeled anomalies spanning different domains with high variability of anomaly types, ratios, and sizes. TSB-UAD includes 18 previously proposed datasets containing 1980 time series and we contribute two collections of datasets. Specifically, we generate 958 time series using a principled methodology for transforming 126 time-series classification datasets into time series with labeled anomalies. In addition, we present data transformations with which we introduce new anomalies, resulting in 10828 time series with varying complexity for anomaly detection. Finally, we evaluate 12 representative methods demonstrating that TSB-UAD is a robust resource for assessing anomaly detection methods. We make our data and code available at www.timeseries.org/TSB-UAD. TSB-UAD provides a valuable, reproducible, and frequently updated resource to establish a leaderboard of univariate time-series anomaly detection methods.
OnlineSTL: Scaling Time Series Decomposition by 100x [Download Paper] [Scalable Data Science] Abhinav Mishra (Splunk)*, Ram Sriharsha (Splunk), Sichen Zhong (Splunk) Decomposing a complex time series into trend, seasonality, and remainder components is an important primitive that facilitates time series anomaly detection, change point detection, and forecasting. Although numerous batch algorithms are known for time series decomposition, none operate well in an online scalable setting where high throughput and real-time response are paramount. In this paper, we propose OnlineSTL, a novel online algorithm for time series decomposition which is highly scalable and is deployed for real-time metrics monitoring on high-resolution, high-ingest rate data. Experiments on different synthetic and real world time series datasets demonstrate that OnlineSTL achieves orders of magnitude speedups (100x) for large seasonalities while maintaining quality of decomposition.
Quantifying identifiability to choose and audit epsilon in differentially private deep learning [Download Paper] Daniel Bernau (SAP)*, Günther Eibl (FH Salzburg), Philip-William Grassal (Heidelberg University), Hannah Keller (SAP SE), Florian Kerschbaum (University of Waterloo) Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters (epsilon, delta). Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss (epsilon, delta) might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which (epsilon, delta) are only indirectly related. We transform (epsilon, delta) to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical (epsilon, delta).
Frequency Estimation Under Multiparty Differential Privacy: One-shot and Streaming [Download Paper] Ziyue Huang (HKUST)*, Yuan Qiu (Hong Kong Univ. of Science and Technology ), Ke Yi (Hong Kong University of Science and Technology), Graham Cormode (University of Warwick) We study the fundamental problem of frequency estimation under both privacy and communication constraints, where the data is distributed among $k$ parties. We consider two application scenarios: (1) one-shot, where the data is static and the aggregator conducts a one-time computation; and (2) streaming, where each party receives a stream of items over time and the aggregator continuously monitors the frequencies. We adopt the model of multiparty differential privacy (MDP), which is more general than local differential privacy (LDP) and (centralized) differential privacy. Our protocols achieve optimality (up to logarithmic factors) permissible by the more stringent of the two constraints. In particular, when specialized to the $\varepsilon$-LDP model, our protocol achieves an error of $\sqrt{k}/(e^{\Theta(\varepsilon)}-1)$ using $O(k\max\{ \varepsilon, \log \frac{1}{\varepsilon} \})$ bits of communication and $O(k \log u)$ bits of public randomness, where $u$ is the size of the domain.
Federated Matrix Factorization with Privacy Guarantee [Download Paper] Zitao Li (Purdue University)*, Bolin Ding ("Data Analytics and Intelligence Lab, Alibaba Group"), Ce Zhang (ETH), Ninghui Li (Purdue University), Jingren Zhou (Alibaba Group) Matrix factorization (MF) approximates unobserved ratings in a rating matrix, whose rows correspond to users and columns correspond to items to be rated, and has been serving as a fundamental building block in recommendation systems. This paper comprehensively studies the problem of matrix factorization in different federated learning (FL) settings, where a set of parties want to cooperate in training but refuse to share data directly. We first propose a generic algorithmic framework for various settings of federated matrix factorization (FMF) and provide a theoretical convergence guarantee. We also systematically characterize privacy-leakage risks in data collection, training, and publishing stages for three different settings and introduce privacy notions to provide end-to-end privacy protections. The first one is vertical federated learning(VFL), where multiple parties have the ratings from the same set of users but on disjoint sets of items. The second one is horizontal federated learning(HFL), where parties have ratings from different sets of users but on the same set of items. The third setting is local federated learning(LFL), where the ratings of the users are only stored on their local devices. We introduce adapted versions of FMF with the privacy notions guaranteed in the three settings. In particular, a new private learning technique called embedding clipping is introduced and used in all three settings to ensure differential privacy. For the LFL setting, we combine differential privacy with secure aggregation to protect the communication between user devices and the server with a strength similar to the local differential privacy model, but much better accuracy. We perform experiments to demonstrate the effectiveness of our approaches.
Frequency-Hiding Order-Preserving Encryption with Small Client Storage [Download Paper] dongjie li (Tianjin Key Laboratory of Network and Data Security Technology(Nankai University)), Siyi Lv (Nankai University), Yanyu Huang (Nankai University ), Yijing Liu (Nankai University ), Tong Li (Nankai University)*, Zheli Liu (Nankai University), Liang Guo (Huawei Technologies Co., Ltd.) The range query on encrypted databases is usually implemented using the order-preserving encryption (OPE) technique which preserves the order of plaintexts. Since the frequency leakage of plaintexts makes OPE vulnerable to frequency-analyzing attacks, some frequency-hiding order-preserving encryption (FH-OPE) schemes are proposed. However, existing FH-OPE schemes require either the large client storage of size O(n) or O(log n) rounds of interactions for each query, where $n$ is the total number of plaintexts. To this end, we propose a FH-OPE scheme that achieves the small client storage without additional client-server interactions. In detail, our scheme achieves O(N) client storage and 1 interaction per query, where N is the number of distinct plaintexts and N <= n. Especially, our scheme has a remarkable performance when N << n. Moreover, we design a new coding tree for producing the order-preserving encoding which indicates the order of each ciphertext in the database. The coding strategy of our coding tree ensures that encodings update in the low frequency when inserting new ciphertexts. Experimental results show that the single round interaction and low-frequency encoding updates make our scheme more efficient than previous FH-OPE schemes.
Scalable Byzantine Fault Tolerance via Partial Decentralization [Download Paper] Balaji Arun (Virginia Tech)*, Binoy Ravindran (Virginia Tech) Byzantine consensus is a critical component in many permissioned Blockchains and distributed ledgers. We propose a new paradigm for designing BFT protocols called DQBFT that addresses three major performance and scalability challenges that plague past protocols: (i) high communication costs to reach geo-distributed agreement, (ii) uneven resource utilization hampering performance, and (iii) performance degradation under varying node and network conditions and high-contention workloads. Specifically, DQBFT divides consensus into two parts: 1) durable command replication without a global order, and 2) consistent global ordering of commands across all replicas. DQBFT achieves this by decentralizing the heavy task of replicating commands while centralizing the ordering process. Under the new paradigm, we develop a new protocol, Destiny that uses a combination of three techniques to achieve high performance and scalability: using a trusted subsystem to decrease consensus's quorum size, using threshold signatures to attain linear communication costs, reducing client communication. Our evaluations on 300-replica geo-distributed deployment reveal that DQBFT protocols achieve significant performance gains over prior art: ~3x better throughput and ~50% better latency.
NeuChain: A Fast Permissioned Blockchain System with Deterministic Ordering [Download Paper] Zeshun Peng (Northeastern University, China), Yanfeng Zhang (Northeastern University)*, Qian Xu (Northeastern University), Haixu Liu (Northeastern University), Yuxiao Gao (ä¸å大å¦), Xiaohua Li (Northeastern University), Ge Yu (Northeast University) Blockchain serves as a replicated transactional processing system in a trustless distributed environment. Existing blockchain systems all rely on an explicit ordering step to determine the global order of transactions that are collected from multiple peers. The ordering consensus can be the bottleneck since it must be Byzantine-fault tolerant and can scarcely benefit from parallel execution. In this paper, we propose an ordering-free architecture that makes ordering implicit through deterministic execution. Based on this novel architecture, we develop a permissioned blockchain system NeuChain. A number of key optimizations such as asynchronous block generation and pipelining are leveraged for high throughput and low latency. Several security mechanisms are also designed to make our system robust to malicious attacks. Our geo-distributed experimental results show that NeuChain can achieve 47.2-64.1x throughput improvement over HyperLedger Fabric and 1.6-12.2x throughput improvement over the state-of-the-art high performance blockchains.
Beaconnect: Continuous Web Performance A/B Testing at Scale [Download Paper] Wolfram Wingerath (University of Oldenburg)*, Benjamin Wollmer (University of Hamburg), Markus Bestehorn (Amazon Web Services), Stephan Succo (Baqend), Sophie Ferrlein (Baqend), Florian Bücklers (Baqend), Jörn Domnik (Baqend), Fabian Panse (Universität Hamburg), Erik Witt (Baqend), Anil Sener (Amazon Web Services), Felix Gessert (Universität Hamburg), Norbert Ritter (Universität Hamburg) Content delivery networks (CDNs) are critical for minimizing access latency in the Web as they efficiently distribute online resources across the globe. But since CDNs can only be enabled on the scope of entire websites (and not for individual users or user groups), the effects of page speed acceleration are often quantified with potentially skewed before-after comparisons rather than statistically sound A/B tests. In this paper, we introduce the system Beaconnect for tracking and analyzing Web performance without being subject to these limitations. Our contributions are threefold. First, Beaconnect is natively compatible with A/B testing Web performance as it is built for a custom browser-based acceleration approach and thus does not rely on traditional CDN technology. Second, we present our continuous aggregation pipeline that achieves sub-minute end-to-end latency. Third, we describe and evaluate a scheme for continuous real-time reporting that is especially efficient for large customers and processes data from over 100 million monthly users at Baqend.
APEX: A High-Performance Learned Index on Persistent Memory [Download Paper] Baotong Lu (Chinese University of Hong Kong)*, Jialin Ding (MIT), Eric Lo (Chinese University of Hong Kong), Umar Farooq Minhas (Apple), Tianzheng Wang (Simon Fraser University) The recently released persistent memory (PM) offers high performance, persistence, and is cheaper than DRAM. This opens up new possibilities for indexes that operate and persist data directly on the memory bus. Recent learned indexes exploit data distribution and have shown great potential for some workloads. However, none support persistence or instant recovery, and existing PM-based indexes typically evolve B+-trees without considering learned indexes. This paper proposes APEX, a new PM-optimized learned index that offers high performance, persistence, concurrency, and instant recovery. APEX is based on ALEX, a state-of-the-art updatable learned index, to combine and adapt the best of past PM optimizations and learned indexes, allowing it to reduce PM accesses while still exploiting machine learning. Our evaluation on Intel DCPMM shows that APEX can perform up to â?5Ã better than existing PM indexes and can recover from failures in â?2ms.
Index Checkpoints for Instant Recovery in In-Memory Database Systems [Download Paper] Leon Lee (HuaweiCloud)*, Siphrey Xie (Huawei Technologies Co. Ltd.), Yunus Ma (Huawei Technologies Co. Ltd.), Shimin Chen (Chinese Academy of Sciences) We observe that the time bottleneck during the recovery phase of an IMDB (In-Memory DataBase system) shifts from log replaying to index rebuilding after the state-of-art techniques for instant recovery have been applied. In this paper, we investigate index checkpoints to eliminate this bottleneck. However, improper designs may lead to inconsistent index checkpoints or incur severe performance degradation. For the correctness challenge, we combine two techniques, i.e., deferred deletion of index entries, and on-demand clean-up of dangling index entries after recovery, to achieve data correctness. For the efficiency challenge, we propose three wait-free index checkpoint algorithms, i.e., ChainIndex, MirrorIndex, IACoW, for supporting efficient normal processing and fast recovery. We implement our proposed solutions in HiEngine, an IMDB being developed as part of Huaweiâs nextgeneration cloud-native database product. We evaluate the impact of index checkpoint persistence on recovery and transaction performance using two workloads (i.e., TPC-C and Microbench). We analyze the pros and cons of each algorithm. Our experimental results show that HiEngine can be recovered instantly (i.e., in â?0 s) with only slight (i.e., 5% - 11%) performance degradation . Therefore, we strongly recommend integrating index checkpointing into IMDBs if recovery time is a crucial product metric.
YeSQL: “You extend SQL” with Rich and Highly Performant User-Defined Functions in Relational Databases [Download Paper] Yannis E Foufoulas (University of Athens)*, Alkis Simitsis (Athena Research Center), Eleftherios Stamatogiannakis (University of Athens), Yannis Ioannidis (University of Athens) The diversity and complexity of modern data management applications have led to the extension of the relational paradigm with syntactic and semantic support for User-Defined Functions (UDFs). Although well-established in traditional DBMS settings, UDFs have become central in many application contexts as well, such as data science, data analytics, and edge computing. Still, a critical limitation of UDFs is the impedance mismatch between their evaluation and relational processing. In this paper, we present YeSQL, an SQL extension with rich UDF support along with a pluggable architecture to easily integrate it with either server-based or embedded database engines. YeSQL currently supports Python UDFs fully integrated with relational queries as scalar, aggregator, or table functions. Key novel characteristics of YeSQL include easy implementation of complex algorithms and several performance enhancements, including tracing JIT compilation of Python UDFs, parallelism and fusion of UDFs, stateful UDFs, and seamless integration with a database engine. Our experimental analysis showcases the usability and expressiveness of YeSQL and demonstrates that our techniques of minimizing context switching between the relational engine and the Python VM are very effective and achieve significant speedups up to 68x in common, practical use cases compared to earlier approaches and alternative implementation choices.
AB-tree: index for concurrent random sampling and updates [Download Paper] Zhuoyue Zhao (University at Buffalo - SUNY)*, Dong Xie (Penn State University), Feifei Li (Alibaba Group) There has been an increasing demand for real-time data analytics. Approximate Query Processing (AQP) is a popular option for that because it can use random sampling to trade some accuracy for lower query latency. However, the state-of-the-art AQP system either relies on scan-based sampling algorithms to draw samples, which can still incur a non-trivial cost of table scan, or creates samples of the database in a preprocessing step, which are hard to update. The alternative is to use the aggregate B-tree indexes to support both random sampling and updates in database with logarithmic time. However, to the best of our knowledge, it is unknown how to design an aggregate B-tree to support highly concurrent random sampling and updates, due to the difficulty of maintaining the aggregate weights correctly and efficiently with concurrency. In this work, we identify the key challenges to achieve high concurrency and present AB-tree, an index for highly concurrent random sampling and update operations. We also conduct extensive experiments to show its efficiency and efficacy in a variety of workloads.
Efficient Load-Balanced Butterfly Counting on GPU [Download Paper] Qingyu Xu (RenMing University of China), Feng Zhang (Renmin University of China)*, Zhiming Yao (Renmin University of China), Lv Lu (Renmin University of China), Xiaoyong Du (Renmin University of China), Dong Deng (Rutgers Universituy - New Brunswick), Bingsheng He (National University of Singapore) Butterfly counting is an important and costly operation in bipartite graphs. GPUs are popular parallel heterogeneous devices and can bring significant performance improvement for data science applications. Unfortunately, no work enables efficient butterfly counting on GPU currently. To fill this gap, we propose a GPU-based butterfly counting, called G-BFC. G-BFC addresses three main challenges. First, butterfly counting involves massive serial operations, which leads to severe synchronization overheads and performance degradation. We unlock the serial region and utilize the shared memory on GPU to handle it. Second, butterfly counting on GPU faces the workload imbalance problem. We develop a novel adaptive strategy to balance the workload among threads, improving efficiency. Third, butterfly counting in parallel suffers from the traversal of the huge amount of two-hop paths, also called wedges, in bipartite graphs. We develop a novel preprocessing strategy, which can effectively reduce the number of wedges to be traversed and memory cost. Experiments show that G-BFC brings significant performance benefits. On ten real datasets, G-BFC can achieve 19.8 times performance speedup over the state-of-the-art solution.
In-Network Leaderless Replication for Distributed Data Stores [Download Paper] Gyuyeong Kim (Sungshin Women's University), Wonjun Lee (Korea University)* Leaderless replication allows any replica to handle any type of request to achieve read scalability and high availability for distributed data stores. However, this entails burdensome coordination overhead of replication protocols, degrading write throughput. In addition, the data store still requires coordination for membership changes, making it hard to resolve server failures quickly. To this end, we present NetLR, a replicated data store architecture that supports high performance, fault tolerance, and linearizability simultaneously. The key idea of NetLR is moving the entire replication functions into the network by leveraging the switch as an on-path in-network replication orchestrator. Specifically, NetLR performs consistency-aware read scheduling, high-performance write coordination, and active fault adaptation in the network switch. Our in-network replication eliminates inter-replica coordination for writes and membership changes, providing high write performance and fast failure handling. NetLR can be implemented using programmable switches at a line rate with only 5.68% of additional memory usage. We implement a prototype of NetLR on an Intel Tofino switch and conduct extensive testbed experiments. Our evaluation results show that NetLR is the only solution that achieves high throughput and low latency and is robust to server failures.
Data Management in Microservices: State of the Practice, Challenges, and Research Directions [Download Paper] Rodrigo N Laigner (University of Copenhagen)*, Yongluan Zhou (University of Copenhagen), Marcos Antonio Vaz Salles (University of Copenhagen (DIKU)), Yijian Liu (University of Copenhagen), Marcos Kalinowski (PUC-Rio) Microservices have become a popular architectural style for datadriven applications, given their ability to functionally decompose an application into small and autonomous services to achieve scalability, strong isolation, and specialization of database systems to the workloads and data formats of each service. Despite the accelerating industrial adoption of this architectural style, an investigation of the state of the practice and challenges practitioners face regarding data management in microservices is lacking. To bridge this gap, we conducted a systematic literature review of representative articles reporting the adoption of microservices, we analyzed a set of popular open-source microservice applications, and we conducted an online survey to cross-validate the findings of the previous steps with the perceptions and experiences of over 120 experienced practitioners and researchers. Through this process, we were able to categorize the state of practice of data management in microservices and observe several foundational challenges that cannot be solved by software engineering practices alone, but rather require system-level support to alleviate the burden imposed on practitioners. We discuss the shortcomings of state-of-the-art database systems regarding microservices and we conclude by devising a set of features for microservice-oriented database systems
DSON: JSON CRDT Using Delta-Mutations For Document Stores [Download Paper] Arik Rinberg (Technion)*, Tomer Solomon (IBM), Roee Shlomo (IBM), Guy Khazma (IBM), Gal Lushi (IBM Research), Idit Keidar (Technion), Paula Ta-shma (IBM) We propose DSON, a space efficient $\delta$-based CRDT approach for distributed JSON document stores, enabling high availability at a global scale, while providing {\bf strong} eventual consistency guarantees. We define the semantics of our CRDT based approach formally, and prove its correctness and convergence. Previous approaches optimize for collaborative document editing and store metadata proportional to the number of updates to a document, which is not acceptable for long lived document management. The metadata stored with our approach is $O(D + n \log n)$, where $n$ is the number of replicas and $D$ is the number of document elements, and is not dependent on the number of document updates. We also implement our approach and demonstrate its space efficiency empirically. This provides the basis for robust highly available distributed document stores with well defined semantics and safety guarantees, relieving application developers from the burden of conflict resolution.
Moneyball: Proactive Auto-Scaling in Microsoft Azure SQL Database Serverless [Download Paper] Olga Poppe (Microsoft)*, Qun Guo (Microsoft), Willis Lang (Microsoft), Pankaj Arora (Microsoft), Morgan Oslake (Microsoft), Shize Xu (Microsoft), Ajay Kalhan (Microsoft) Microsoft Azure SQL Database is among the leading relational database service providers in the cloud. Serverless compute automatically scales resources based on workload demand. When a database becomes idle its resources are reclaimed. When activity returns, resources are resumed. Customers pay only for resources they used. However, scaling is currently merely reactive, not proactive, according to customers' workloads. Therefore, resources may not be immediately available when a customer comes back online after a prolonged idle period. In this work, we focus on reducing this delay in resource availability by predicting the pause/resume patterns and proactively resuming resources for each database. Furthermore, we avoid taking away resources for short idle periods to relieve the back-end from ineffective pause/resume workflows. Results of this study are currently being used worldwide to find the middle ground between quality of service and cost of operation.
Chukonu: A Fully-Featured High-Performance Big Data Framework that Integrates a Native Compute Engine into Spark [Download Paper] Bowen Yu (Tsinghua University)*, Guanyu Feng (Tsinghua University), Huanqi Cao (Tsinghua University), Xiaohan Li (Tsinghua University), Zhenbo Sun (Tsinghua University), Haojie Wang (Tsinghua University), Xiaowei Zhu (Tsinghua University), Weimin Zheng (Tsinghua university), Wenguang Chen (Tsinghua University) Apache Spark is a widely deployed big data analytics framework that offers such attractive features as resiliency, load-balancing, and a rich ecosystem. However, there is still plenty of room for improvement in its performance. Although a data-parallel system in a native programming language significantly improves performance, it may require re-implementing many functionalities of Spark to become a full-featured system. It is desirable for native big data systems to just write a compute engine in native languages to ensure high efficiency, and reuse other mature features provided by Spark rather than re-implement everything. But the interaction between the JVM and the native world risks becoming a bottleneck. This paper proposes Chukonu, a native big data framework that re-uses critical big data features provided by Spark. Owing to our novel DAG-splitting approach, the potential Spark integration overhead is alleviated, and its even outperforms existing pure native big data frameworks. Chukonu splits DAG programs into run-time parts and compile-time parts: The run-time parts are delegated to Spark to offload the complexities due to feature implementations. The compile-time parts are natively compiled. We propose a series of optimization techniques to be applied to the compile-time parts, such as operator fusion, vectorization, and compaction, to significantly reduce the Spark integration overhead. The results of evaluation show that Chukonu has a speedup of up to 71.58Ã (geometric mean 6.09Ã) over Apache Spark, and up to 7.20Ã (geometric mean 2.30Ã) over pure-native frameworks on six commonly-used big data applications. By translating the physical plan produced by SparkSQL into Chukonu programs, Chukonu accelerates SparkSQLâs TPC-DS performance by 2.29Ã.
SwitchTx: Scalable In-Network Coordination for Distributed Transaction Processing [Download Paper] Junru Li (Tsinghua University)*, Youyou Lu (luyouyou@tsinghua.edu.cn), Yiming Zhang (Xiamen University), Qing Wang (Tsinghua University), Zhuo Cheng (Huawei Storage Product Line), Keji Huang (Huawei), Jiwu Shu (shujw@tsinghua.edu.cn) Online-transaction-processing (OLTP) applications require the underlying storage system to guarantee consistency and serializability for distributed transactions involving large numbers of servers, which tends to introduce high coordination cost and cause low system performance. In-network coordination is a promising approach to alleviate this problem, which leverages programmable switches to move a piece of coordination functionality into the network. This paper presents a fast and scalable transaction processing system called SwitchTx. At the core of SwitchTx is a decentralized multi-switch in-network coordination mechanism, which leverages modern switches' programmability to reduce coordination cost while avoiding the central-switch-caused problems in the state-of-the-art Eris transaction processing system. SwitchTx abstracts various coordination tasks (e.g., locking, validating, and replicating) as in-switch gather-and-scatter (GaS) operations, and offloads coordination to a tree of switches for each transaction (instead of to a central switch for all transactions) where the client and the participants connect to the leaves. Moreover, to control the transaction traffic intelligently, SwitchTx reorders the coordination messages according to their semantics and redesigns the congestion control combined with admission control. Evaluation shows that SwitchTx outperforms current transaction processing systems in various workloads by up to 2.16x in throughput, 40.4% in latency, and 41.5% in lock time.
07Sep
Theoretically and Practically Efficient Parallel Nucleus Decomposition [Download Paper] Jessica Shi (MIT)*, Laxman Dhulipala (MIT CSAIL), Julian Shun (MIT) This paper studies the nucleus decomposition problem, which has been shown to be useful in finding dense substructures in graphs. We present a novel parallel algorithm that is efficient both in theory and in practice. Our algorithm achieves a work complexity matching the best sequential algorithm while also having low depth (parallel running time), which significantly improves upon the only existing parallel nucleus decomposition algorithm (Sariyuce et al., PVLDB 2018). The key to the theoretical efficiency of our algorithm is a new lemma that bounds the amount of work done when peeling cliques from the graph, combined with the use of a theoretically-efficient parallel algorithms for clique listing and bucketing. We introduce several new practical optimizations, including a new multi-level hash table structure to store information on cliques space-efficiently and a technique for traversing this structure cache-efficiently. On a 30-core machine with two-way hyper-threading on real-world graphs, we achieve up to a 55x speedup over the state-of-the-art parallel nucleus decomposition algorithm by Sariyuce et al., and up to a 40x self-relative parallel speedup. We are able to efficiently compute larger nucleus decompositions than prior work on several million-scale graphs for the first time.
TaGSim: Type-aware Graph Similarity Learning and Computation [Download Paper] Jiyang Bai (Florida State University), Peixiang Zhao (Florida State University)* Computing similarity between graphs is a fundamental and critical problem in graph-based applications, and one of the most commonly used graph similarity measures is graph edit distance (GED), defined as the minimum number of graph edit operations that transform one graph to another. Existing GED solutions suffer from severe performance issues due in particular to the NP-hardness of exact GED computation. Recently, deep learning has shown early promise for GED approximation with high accuracy and low computational cost. However, existing methods treat GED as a global, coarse-grained graph similarity value, while neglecting the type-specific transformative impacts incurred by different types of graph edit operations, including node insertion/deletion, node relabeling, edge insertion/deletion, and edge relabeling. In this paper, we propose a type-aware graph similarity learning and computation framework, TaGSim (Type-aware Graph Similarity), that estimates GED in a fine-grained approach w.r.t. different graph edit types. Specifically, for each type of graph edit operations, TaGSim models its unique transformative impacts upon graphs, and encodes them into high-quality, type-aware graph embeddings, which are further fed into type-aware neural networks for accurate GED estimation. Extensive experiments on five real-world datasets demonstrate the effectiveness and efficiency of TaGSim, which significantly outperforms state-of-the-art GED solutions.
ABC: Attributed Bipartite Co-clustering [Download Paper] Junghoon Kim (Nanyang Technological University)*, Kaiyu Feng (NTU), Gao Cong (Nanyang Technological Univesity), Diwen Zhu (Alibaba), Wenyuan Yu (Alibaba Group), Chunyan Miao (NTU) Finding a set of co-clusters in a bipartite network is a fundamental and important problem. In this paper, we present the Attributed Bipartite Co-clustering (ABC) problem which unifies two main concepts: (i) bipartite modularity optimization, and (ii) attribute cohesiveness. To the best of our knowledge, this is the first work to find co-clusters while considering the attribute cohesiveness. We prove that ABC is NP-hard and is not in APX, unless P=NP. We propose three algorithms: (1) a top-down algorithm; (2) a bottom-up algorithm; (3) a group matching algorithm. Extensive experimental results on real-world attributed bipartite networks demonstrate the efficiency and effectiveness of our algorithms.
View Selection over Knowledge Graphs in Triple Stores [Download Paper] Theofilos Mailis (Kapodistrian University of Athens)*, Yannis Kotidis (Athens University of Economics and Business), Stamatis Christoforidis (University of Athens), Evgeny Kharlamov (University of Oslo), Yannis Ioannidis (University of Athens) Knowledge Graphs (KGs) are collections of interconnected and annotated entities that have become powerful assets for data integration, search enhancement, and other industrial applications. Knowledge Graphs such as DbPedia may contain billion of triple relations and are intensively queried with millions of queries per day. A prominent approach to enhance query answering on Knowledge Graph databases is View Materialization, i.e., the materialization of an appropriate set of computations that will improve query performance. We study the problem of view materialization and propose a view selection methodology for processing query workloads with more than a million queries. Our approach heavily relies on subgraph pattern mining techniques that allow to create efficient summarizations of massive query workloads while also identifying the candidate views for materialization. In the core of our work is the correspondence between the view selection problem to that of "Maximizing a Nondecreasing Submodular Set Function Subject to a Knapsack Constraint". The latter leads to a tractable view-selection process for native triple stores that allows an approximation of the optimal selection of views. Our experimental evaluation shows that all the steps of the view-selection process are completed in a few minutes, while the corresponding rewritings accelerate 67.68% of the queries in the DbPedia query workload. Those queries are executed in 2.19% of their initial time on average.
Magic Shapes for SHACL Validation [Download Paper] Shqiponja Ahmetaj (TU Wien)*, Bianca Löhnert (TU Wien), Magdalena Ortiz (TU Wien, Austria), Mantas Simkus (TU Vienna) A key prerequisite for the successful adoption of the Shapes Constraint Language (SHACL)âthe W3C standardized constraint language for RDF graphsâis the availability of automated tools that efficiently validate targeted constraints (known as shapes graphs) over possibly very large RDF graphs. There are already significant efforts to produce optimized engines for SHACL validation, but they focus on restricted fragments of SHACL. For unrestricted SHACL, that is SHACL with unrestricted recursion and negation, there is no validator beyond a proof-of-concept prototype, and existing techniques are inherently incompatible with the goal-driven approaches being pursued by existing validators. Instead they require a global computation on the entire data graph that is not only computationally very costly, but also brittle, and can easily result in validation failures due to conflicts that are irrelevant to the validation targets. To overcome these challenges, we present a âmagicâ transformationâbased on Magic Sets as known from Logic Programmingâthat transforms a SHACL shapes graph ð into a new shapes graph ðâ² whose validation considers only the relevant neighbourhood of the targeted nodes. The new ðâ² is equivalent to ð whenever there are no conflicts between the constraints and the data, and in case the validation of ð fails due to conflicts that are irrelevant to the target, ðâ² may still admit a lazy, target-oriented validation. We implement the algorithm and run preliminary experiments, showing that our transformation can be a stepping stone towards validators for full SHACL, and that it can significantly improve the performance of the only prototype validator that currently supports full recursion and negation.
Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms [Download Paper] Dimitrios Koutsoukos (ETHZ)*, Ingo Müller (Google), Renato Marroquín (Oracle Labs), Ana Klimovic (ETH Zurich), Gustavo Alonso (ETHZ) The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e.,composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the feasibility and advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code to maintain. In fact, changes in the platform affect only those sub-operators that depend on the underlying hardware (in our use cases, mainly the sub-operators related to network communication).We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.
PRUC : P-Regions with User-Defined Constraint [Download Paper] Yongyi Liu (University of California, Riverside)*, Ahmed Mahmood (Purdue University), Amr Magdy (University of California Riverside), Sergio Rey (University of California, Riverside) This paper introduces a generalized spatial regionalization problem, namely, PRUC (P-Regions with User-defined Constraint) that partitions spatial areas into homogeneous regions.PRUC accounts for user-defined constraints imposed over aggregate region properties. We show that PRUC is an NP-Hard problem. To solve PRUC, we introduce GSLO (Global Search with Local Optimization), a parallel stochastic regionalization algorithm. GSLO is composed of two phases: (1) Global Search that initially partitions areas into regions that satisfy a user-defined constraint, and (2) Local Optimization that further improves the quality of the partitioning with respect to intra-region similarity. We conduct an extensive experimental study using real datasets to evaluate the performance of GSLO. Experimental results show that GSLO is up to 100x faster than the state-of-the-art algorithms. GSLO provides partitioning that is up to 6x better with respect to intra-region similarity. Furthermore, GSLO is able to handle 4x larger datasets than the state-of-the-art algorithms.
Continuous Social Distance Monitoring in Indoor Space [Download Paper] Harry Kai-ho Chan (Roskilde University)*, Huan Li (Aalborg University), Xiao Li (Roskilde University), Hua Lu (Roskilde University) The COVID-19 pandemic has caused over 6 million deaths since 2020. To contain the spread of the virus, social distancing is one of the most simple yet effective approaches. Motivated by this, in this paper we study the problem of continuous social distance monitoring (SDM) in indoor space, in which we can monitor and predict the pairwise distances between moving objects (people) in a building in real time. SDM can also serve as the fundamental service for downstream applications, e.g., a mobile alert application that prevents its users from potential close contact with others. To facilitate the monitoring process, we propose a framework that takes the current and future uncertain locations of the objects into account, and finds the object pairs that are close to each other in a near future. We develop efficient algorithms to update the result when object locations update. We carry out experiments on both real and synthetic datasets. The results verify the efficiency and effectiveness of our proposed framework and algorithms.
Operon: An Encrypted Database for Ownership-Preserving Data Management [Download Paper] Sheng Wang (Alibaba Group)*, Yiran Li (Alibaba Group), Huorong Li (Alibaba Group), Feifei Li (Alibaba Group), Chengjin Tian (Alibaba Group), Le Su (Alibaba Group), yanshan Zhang (Alibaba Group), Yubing Ma (Alibaba Group), Lie Yan (Alibaba Group), Yuanyuan Sun (Alibaba Group), Xuntao Cheng (Alibaba Group), Xiaolong Xie (Alibaba Group), Yu Zou (Alibaba Group) The past decade has witnessed the rapid development of cloud computing and data-centric applications. While these innovations offer numerous attractive features for data processing, they also bring in new issues about the loss of data ownership. Though some encrypted databases have emerged recently, they can not fully address these concerns for the data owner. In this paper, we propose an ownership-preserving database (OPDB), a new paradigm that characterizes different rolesâ responsibilities from nowadays applications and preserves data ownership throughout the entire application. We build Operon to follow the OPDB paradigm, which utilizes the trusted execution environment (TEE) and introduces a behavior control list (BCL). Different from access controls that merely handle accessibility permissions, BCL further makes data operation behaviors under control. Besides, we make Operon practical for real-world applications, by extending database capabilities towards flexibility, functionality and ease of use. Operon is the first database framework with which the data owner exclusively controls its data across different rolesâ subsystems. We have successfully integrated Operon with different TEEs, i.e., Intel SGX and an FPGA-based implementation, and various database services on Alibaba Cloud, i.e., PolarDB and RDS PostgreSQL. The evaluation shows that Operon achieves 71% - 97% of the performance of plaintext databases under the TPC-C benchmark while preserving the data ownership.
Manu: A Cloud Native Vector Database Management System [Download Paper] Rentong Guo (Zilliz), Xiaofan Luan (Zilliz), Long Xiang (Southern University of Science and Technology), Xiao Yan (Southern University of Science and Technology), Xiaomeng Yi (Zilliz), Jigao Luo (Zilliz), Qianya Cheng (Zilliz), Weizhi Xu (Zilliz), Jiarui Luo (Southern University of Science and Technology), Frank Liu (Zilliz), Zhenshan Cao (Zilliz), Yanliang Qiao (Zilliz), Ting Wang (zilliz), Bo Tang (Southern University of Science and Technology)*, Charles Xie (Zilliz) With the development of learning-based embedding models, embedding vectors are widely used for analyzing and searching unstructured data. As vector collections exceed billion-scale, fully managed and horizontally scalable vector databases are necessary. In the past three years, through interaction with our 1200+ industry users, we have sketched a vision for the features that next generation vector databases should have, which include long-term evolvability, tunable consistency, good elasticity, and high performance. We present Manu, a cloud native vector database that implements these features. It is difficult to integrate all these features if we follow traditional DBMS design rules. As most vector data applications do not require complex data models and strong data consistency, our design philosophy is to relax the data model and consistency constraints in exchange for the aforementioned features. Specifically, Manu firstly exposes the write-ahead log (WAL) and binlog as backbone services. Secondly, write components are designed as log publishers while all read-only analytic and search components are designed as independent subscribers to the log services. Finally, we utilize multi-version concurrency control (MVCC) and a delta consistency model to simplify the communication and cooperation among the system components. These designs achieve a low coupling among the system components, which is essential for elasticity and evolution. We also extensively optimize Manu for performance and usability with hardware-aware implementations and support for complex search semantics. Manu has been used for many applications, including, but not limited to, recommendation, multimedia, language, medicine and security. We evaluated in three typical application scenarios to demonstrate its efficiency, elasticity, and scalability.
Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems [Download Paper] Martin Boissier (Hasso Plattner Institute)* Data encoding has been applied to database systems for decades as it mitigates bandwidth bottlenecks and reduces storage requirements. But even in the presence of these advantages, most in-memory database systems use data encoding only conservatively as the negative impact on runtime performance can be severe. Real-world systems with large parts being infrequently accessed and cost efficiency constraints in cloud environments require solutions that automatically and efficiently select encoding techniques, including heavy-weight compression. In this paper, we introduce workload-driven approaches to automaticaly determine memory budget-constrained encoding configurations using greedy heuristics and linear programming. We show for TPC-H, TPC-DS, and the Join Order Benchmark that optimized encoding configurations can reduce the main memory footprint significantly without a loss in runtime performance over state-of-the-art dictionary encoding. To yield robust selections, we extend the linear programming-based approach to incorporate query runtime constraints and mitigate unexpected performance regressions.
LANNS: A Web-Scale Approximate Nearest Neighbor Lookup System [Download Paper] [Scalable Data Science] Ishita Doshi (LinkedIn, Bengaluru)*, Dhritiman Das (LinkedIn, Bengaluru), Ashish Bhutani (Uber), Rajeev Kumar (LinkedIn, Bengaluru), Rushi Bhatt (Compass), Niranjan Balasubramanian (LinkedIn) Nearest neighbor search (NNS) has a wide range of applications in information retrieval, computer vision, machine learning, databases, and other areas. Existing state-of-the-art algorithm for nearest neighbor search, Hierarchical Navigable Small World Networks (HNSW), is unable to scale to large datasets of 100M records in high dimensions. In this paper, we propose LANNS, an end-to-end platform for Approximate Nearest Neighbor Search, which scales for web-scale datasets. Library for Large Scale Approximate Nearest Neighbor Search (LANNS) is deployed in multiple production systems for identifying top-K (100 <= k <= 200) approximate nearest neighbors with a latency of a few milliseconds per query, high throughput of ~2.5k Queries Per Second (QPS) on a single node, on large (e.g., ~180M data points) high dimensional (50-2048 dimensional) datasets.
Tair-PMem: A Fully Durable Non-Volatile Memory Database [Download Paper] Caixin Gong (Alibaba Group)*, Chengjin Tian (Alibaba Group), Zhengheng Wang (Alibaba Group), Sheng Wang (Alibaba Group), Xiyu Wang (Alibaba Group), Qiulei Fu (Alibaba Group), Wu Qin (Alibaba), Long Qian (Alibaba Group), Rui Chen (Alibaba), Jiang Qi (Alibaba), Ruo Wang (Alibaba), Guoyun Zhu (Alibaba Group), Chenghu Yang (Alibaba Group), Wei Zhang (Alibaba Inc.), Feifei Li (Alibaba Group) In-memory databases (IMDBs) have been the backbone of modern systems that demand high throughput and low latency. Because of the cost and volatility of DRAM, IMDBs become incompetent when dealing with workloads that require large data volume and strict durability. The emergence of non-volatile memory (NVM) brings new opportunities for IMDBs to tackle this situation. However, it is non-trivial to build an NVM-based IMDB, due to performance degradation, NVM programming complexity, and other challenges. In this paper, we present Tair-PMem, an NVM-based enterprise-strength database atop Redis, the most popular IMDB. Tair-PMem adopts a well-controlled data layout and a log-as-user-data design to mitigate NVM overheads. It eases the NVM programming complexity by providing a hybrid memory programming toolkit. To better leverage the enterprise-strength features and implementations from Redis, Tair-PMem retrofits it in a less intrusive way to achieve full compatibility and stability, while retaining its advanced features. With all of the above techniques elaborately implemented, Tair-PMem satisfies full durability, high throughput, and low latency at the same time. Tair-PMem has now been publicly available as a cloud service on Alibaba Cloud. To the best of our knowledge, Tair-PMem is the first cloud service that makes good use of the persistence capability of NVM.
CloudJump: Optimizing Cloud Databases for Cloud Storages [Download Paper] Zongzhi Chen (Alibaba Group)*, xinjun Yang (Alibaba Group), Feifei Li (Alibaba Group), Xuntao Cheng (Alibaba Group), Qingda Hu (Alibaba Group), Zheyu Miao (Alibaba Group), Rongbiao Xie (Alibaba group), Xiaofei Wu (Alibaba Group), Kang Wang (Alibaba Group), Zhao Song (Alibaba Group), Haiqing Sun (Alibaba Group), Zechao Zhuang (Alibaba Group), Yuming Yang (Alibaba Group), Jie Xu (Alibaba Group), Liang Yin (Alibaba Group), Wenchao Zhou (Alibaba Group), Sheng Wang (Alibaba Group) There has been an increasing interest in building cloud-native databases that decouple computation and storage for elasticity. A cloud-native database often adopts a cloud storage underneath its storage engine, leveraging another layer of virtualization and providing a high-performance and elastic storage service without exposing complex storage details. It helps reduce the maintenance cost and expedite development cycles for the database kernels. We have observed that there are significant differences between the local and the cloud storage that invalid many designs inside existing databases when they are ported to the cloud storage. In this paper, we analyze the challenges and opportunities of both B-tree and LSM-tree-based storage engines when they are deployed on a cloud storage. We propose an optimization framework that guides database developers to transform on-premise databases into their cloud-native counterparts. We use a B+-tree-based InnoDB as a demonstration vehicle where we have implemented a suite of optimizations using the proposed framework and extend such efforts to the LSM-tree-based RocksDB. On both engines, our evaluations show significant performance improvements on the cloud storage.
Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs [Download Paper] [Best Experiment, Analysis and Benchmark Paper] Jeremy Chen (University of Waterloo)*, Yuqing Huang (University of Waterloo), Mushi Wang (university of waterloo), Semih Salihoglu (University of Waterloo), Kenneth Salem (University of Waterloo) This paper is an experimental and analytical study of two classes of summary-based cardinality estimators that use statistics about input relations and small-size joins in the context of graph database management systems: (i) optimistic estimators that make uniformity and conditional independence assumptions; and (ii) the recent pessimistic estimators that use information theoretic linear programs (LPs). We begin by analyzing how optimistic estimators use pre-computed statistics to generate cardinality estimates. We show these estimators can be modeled as picking bottom-to-top paths in a cardinality estimation graph (CEG), which contains sub-queries as nodes and edges whose weights are average degree statistics. We show that existing optimistic estimators have either undefined or fixed choices for picking CEG paths as their estimates and ignore alternative choices. Instead, we outline a space of optimistic estimators to make an estimate on CEGs, which subsumes existing estimators. We show, using an extensive empirical analysis, that effective paths depend on the structure of the queries. While on acyclic queries and queries with small-size cycles, using the maximum-weight path is effective to address the well known underestimation problem, on queries with larger cycles these estimates tend to overestimate, which can be addressed by using minimum weight paths. We next show that optimistic estimators and seemingly disparate LP-based pessimistic estimators are in fact connected. Specifically, we show that CEGs can also model some recent pessimistic estimators. This connection allows us to adopt an optimization from pessimistic estimators to optimistic ones, and provide insights into the pessimistic estimators, such as showing that they have combinatorial solutions.
FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems [Download Paper] Pengfei Li (Huazhong University of Science and Technology), Yu Hua (Huazhong University of Science and Technology)*, Jingnan Jia (Huazhong University of Science and Technology), Pengfei Zuo (Huazhong University of Science and Technology) Index structures in memory systems become important to improve the entire system performance. The promising learned indexes leverage deep-learning models to complement existing index structures and obtain significant performance improvements. Existing schemes rely on a delta-buffer to support the scalability, which however incurs high overheads when a large number of data are inserted, due to the needs of checking both learned indexes and extra delta-buffer. The practical system performance also decreases since the shared delta-buffer quickly becomes large and requires frequent retraining due to high data dependency. To address the problems of limited scalability and frequent retraining, we propose a FINE-grained learned index scheme with high scalability, called FINEdex, which constructs independent models with a flattened data structure (i.e., the data arrays with low data dependency) under the trained data array to concurrently process the requests with low overheads. By further efficiently exploring and exploiting the characteristics of the workloads, FINEdex processes the new requests in-place with the support of non-blocking retraining, hence adapting to the new distributions without blocking the systems. We evaluate FINEdex via YCSB and real-world datasets, and extensive experimental results demonstrate that FINEdex improves the performance respectively by up to 1.8x and 2.5x than state-of-the-art XIndex and Masstree. We have released the open-source codes of FINEdex for public use in GitHub.
VIP Hashing - Adapting to Skew in Popularity of Data on the Fly [Download Paper] Aarati Kakaraparthy (University of Wisconsin, Madison)*, Jignesh Patel (UW - Madison), Brian Kroth (Microsoft), Kwanghyun Park (Microsoft Gray Systems Lab) All data is not equally popular. Often, some portion of data is more frequently accessed than the rest, which causes a skew in popularity of the data items. Adapting to this skew can improve performance, and this topic has been studied extensively in the past for disk-based settings. In this work, we consider an in memory data structure, namely hash table, and show how one can leverage the skew in popularity for higher performance. Hashing is a low-latency operation, sensitive to the effects of caching and code complexity, among other factors. These factors make learning in-the-loop challenging as the overhead of performing additional operations can have significant impact on performance. In this paper, we propose VIP hashing, a hash table method that uses lightweight mechanisms for learning the skew in popularity and adapting the hash table layout on the fly. These mechanisms are non-blocking, i.e, the hash table is operational at all times. The overhead is controlled by sensing changes in the popularity distribution to dynamically switch-on/off the mechanisms as needed. We ran extensive tests against a host of workloads generated by Wiscer, a homegrown benchmarking tool, and we find that VIP hashing improves performance in the presence of skew (22% increase in fetch operation throughput for a hash table with 1M keys under low skew) while adapting to insert and delete operations, and changing popularity distribution of keys on the fly. Our experiments on DuckDB show that VIP hashing reduces the end-to-end execution time of TPC-H query 9 by 20% under low skew.
Leveraging Query Logs and Machine Learning for Parametric Query Optimization [Download Paper] Kapil Vaidya (MIT)*, Anshuman Dutt (Microsoft Research), Vivek Narasayya (Microsoft), Surajit Chaudhuri (Microsoft) Parametric query optimization (PQO) must address two problems: identify a relatively small number of plans to cache for a parameterized query (populateCache), and efficiently select the best cached plan to use for executing any instance of the parameterized query (getPlan). Our approach decouples these two decisions. We formulate populateCache as an optimization problem with the goal of identifying a set of plans that minimizes the optimizer estimated cost of queries in the log, and present an efficient algorithm. For getPlan, we leverage query logs to train machine learning (ML) models to choose the lowest optimizer-estimated cost plan from the cached plans. We conduct extensive experiments using complex parameterized queries from benchmarks and real workloads. Our algorithm for populateCache achieves low geometric mean sub-optimality (1.2) even for complex queries using relatively few plans, and scales well to large query logs. The mean latency of our ML model based getPlan technique (210 microsec) is between one to four orders of magnitude faster compared to prior PQO techniques. The mean sub-optimality is low (1.05), and the 95th percentile sub-optimality (1.3) is between 1.1x and 25x lower compared to prior techniques. Finally, we present an efficient algorithm for getPlan that leverages execution time information in query logs to circumvent inaccuracies of the query optimizer's cost estimates.
Interactive Mining with Ordered and Unordered Attributes [Download Paper] Weicheng Wang (Hong Kong University of Science and Technology)*, Raymond Chi-wing Wong (Hong Kong University of Science and Technology) There are various queries proposed to assist users in finding their favorite tuples from a dataset with the help of user interaction. Specifically, they interact with a user by asking questions. Each question presents two tuples, which are selected from the dataset based on the user's answers to the previous questions, and asks the user to select the one s/he prefers. Following the user feedback, the user preference is learned implicitly, and the best tuple w.r.t. the learned preference is returned. However, existing queries only consider datasets with ordered attributes (e.g., price), where there exists a trivial order on the attribute values. In practice, a dataset can also be described by unordered attributes, where there is no consensus about the order of the attribute values. For example, the size of a laptop is an unordered attribute. One user might favor a large size because s/he could enjoy a large screen, while another user may prefer a small size for portability. In this paper, we study how to find a user's favorite tuple from the dataset that has both ordered and unordered attributes by interacting with the user. We study our problem progressively. First, we look into a special case in which the dataset is described by one ordered and one unordered attributes. We present algorithm DI that is asymptotically optimal in terms of the number of questions asked. Then, we dig into the general case in which the dataset has several ordered and unordered attributes. We propose two algorithms BS and EDI that have provable performance guarantees and perform well empirically. Experiments were conducted on synthetic and real datasets, showing that our algorithms outperform existing algorithms in the number of questions asked and the execution time. Under typical settings, our algorithms ask up to 10 times fewer questions and take several orders of magnitude less time than existing algorithms.
On Repairing Timestamps for Regular Interval Time Series [Download Paper] Chenguang Fang (Tsinghua University), Shaoxu Song (Tsinghua University)*, Yinan Mei (Tsinghua University) Time series data are often with regular time intervals, e.g., in IoT scenarios sensor data collected with a pre-specified frequency, air quality data regularly recorded by outdoor monitors, and GPS signals periodically received from multiple satellites. However, due to various issues such as transmission latency, device failure, repeated request and so on, timestamps could be dirty and lead to irregular time intervals. Amending the irregular time intervals has obvious benefits, not only improving data quality but also leading to more accurate applications such as frequency-domain analysis and more effective compression in storage. The timestamp repairing problem however is challenging, given many interacting factors to determine, including the time interval, the start timestamp, the series length, as well as the matching between the time series before and after repairing. Our contributions in this paper are (1) formalizing the timestamp repairing problem for regular interval time series to minimize the cost w.r.t. move, insert and delete operations; (2) devising an exact approach with advanced pruning strategies based on lower bounds of repairing; (3) proposing an approximation based on bi-directional dynamic programming. The experimental results demonstrate the superiority of our proposal in both timestamp repair accuracy and the aforesaid applications. Remarkably, the repair results can be used to evaluate time series data quality measures. Both the repair and measure functions have been implemented in an open-source time series database, Apache IoTDB.
CHEX: Multiversion Replay with Ordered Checkpoints [Download Paper] Naga Nithin Manne (Argonne National Lab), Shilvi Satpati (DePaul University), Tanu Malik (DePaul University)*, Amitabha Bagchi (IIT Delhi), Ashish Gehani (SRI), Amitabh Chaudhary (University of Chicago) In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These tools, due to containerization, do improve sharing of results. However, they do not improve the efficiency of replay. In this paper we present the multiversion replay problem which arises when multiple versions of an application are containerized, and each version must be replayed to repeat results. To avoid executing each version separately, we develop CHEX, which checkpoints program state and determines when it is permissible to reuse program state across versions. It does so using system call-based execution lineage. Our capability to identify common computations across versions enables us to consider optimizing replay using an in-memory cache, based on a checkpoint-restore-switch system. We show the multi version replay problem is NP-hard, and propose efficient heuristics for it. CHEX reduces overall replay time by sharing common computations but avoids storing a large number of checkpoints. We demonstrate that CHEX maintains lightweight package sharing, and improves the total time of multiversion replay by 50% on average.
SSpaceSaving± An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model [Download Paper] Fuheng Zhao (UCSB)*, Divy Agrawal (University of California, Santa Barbara), Amr El Abbadi (UC Santa Barbara), Ahmed Metwally (Uber) In this paper, we propose the first deterministic algorithms to solve the frequency estimation and frequent item problems in the bounded-deletion model. We establish the space lower bound for solving the deterministic frequent items problem in the bounded deletion model, and propose the Lazy SpaceSaving¡À and SpaceSaving¡À algorithms with optimal space bound. We develop an efficient implementation of the SpaceSaving¡À algorithm that minimizes the latency of update operations using novel data structures. The experimental evaluations testify that SpaceSaving¡À has accurate frequency estimations and achieves very high recall and precision across different data distributions while using minimal space. Our experiments clearly demonstrate that, if allowed the same space, SpaceSaving¡À is more accurate than the state-of-the-art protocols with up to 93% of the items deleted. Moreover, motivated by prior work, we propose Dyadic SpaceSaving¡À, the first deterministic quantile approximation sketch in the bounded-deletion model.
Trie memtables in Cassandra [Download Paper] [Industry] Branimir Lambov (DataStax)* This paper discusses a new memtable implementation for Apache Cassandra which is based on tries (also called prefix trees) and byte-comparable representations of database keys. The implementation is already in production use in DataStax Enterprise 6.8 and is currently in the process of being integrated into mainstream Apache Cassandra as CEP-19. It improves on the legacy solution in the performance of modification and lookup as well as the size of the structure for a given amount of data. Crucially for Cassandra (a database running under the Java Virtual Machine), it also reduces garbage collection and general memory management complexity by operating on blocks of fixed size in large pre-allocated buffers. We detail the architecture of the solution and demonstrate some of the performance improvements that we have been able to achieve with it.
Meta's Next-generation Realtime Monitoring and Analytics Platform [Download Paper] [Industry] Stavros Harizopoulos (Meta), Taylor Hopper (Meta), Morton Mo (Meta)*, Shyam Sundar Chandrasekaran (Meta), Tongguang Chen (Meta), Yan Cui (Meta), Nandini Ganesh (Meta), Gary Helmling (Meta), Hieu Pham (Meta), Sebastian Wong (Meta) Unlike traditional database systems where data and system availability are tied together, a wide class of systems targeting realtime monitoring and analytics over structured logs exists where these properties can be decoupled. In these systems, availability and freshness of data is often more important than perfectly complete answers. One such system is Scuba. Historically, Scuba has favored system availability along with speed and freshness of results over data completeness and durability. While these choices allowed Scuba to grow from terabyte scale to petabyte scale and continue onboarding a variety of use cases, they also came at an operational cost of dealing with incomplete data and managing data loss. In this paper, we present the next generation of Scuba¡¯s architecture, codenamed Kraken, which decouples the storage management from the query serving system and introduces a single, durable source of truth. We also describe the journey of how we deployed Kraken into full production as we gradually turned off the older system with no user-visible down time.
DeepTEA: Effective and Efficient Online Time-dependent Trajectory Outlier Detection [Download Paper] Xiaolin Han (The University of Hong Kong)*, Reynold Cheng ("The University of Hong Kong, China"), Chenhao Ma (The University of Hong Kong), Tobias Grubenman (University of Bonn) In this paper, we study anomalous trajectory detection, which aims to extract abnormal movements of vehicles on the roads. This important problem, which facilities understanding of traffic behavior and detection of taxi fraud, is challenging due to the varying traffic conditions at different times and locations. To tackle this problem, we propose the deep-probabilistic-based time-dependent anomaly detection algorithm (DeepTEA). This method, which employs deep-learning methods to obtain time-dependent outliners from a huge volume of trajectories, can handle complex traffic conditions and detect outliners accurately. We further develop a fast and approximation version of DeepTEA, in order to capture abnormal behaviors in real-time. Compared with state-of-the-art solutions, our method is 17.52% more accurate than seven competitors on average, and can handle millions of trajectories.
Example-based Spatial Pattern Matching [Download Paper] Yue Chen (Nanyang Technological University)*, Kaiyu Feng (NTU), Gao Cong (Nanyang Technological Univesity), Han Mao Kiah (Nanyang Technological University) The prevalence of GPS-enabled mobile devices and location-based services yield massive volume of spatial objects where each object contains information including geographical location, name, address, category and other attributes. This paper introduces a novel type of query termed example-based spatial pattern matching (EPM) query. It takes as input a set of spatial objects, each of which is associated with one or more keywords and a location. These objects serve as an example that depicts the spatial pattern that users want to retrieve. The EPM query returns all sets of objects that match the spatial pattern. The EPM query can be used for applications like urban planning, scene recognition and similar region search. We propose an efficient algorithm and three pruning techniques to answer EPM queries. Furthermore, we provide an approximation guarantee for intermediate results of the algorithm. Our experimental evaluations on four real-world datasets demonstrate the effectiveness and efficiency of our proposed algorithm and techniques.
Points-of-Interest Relationship Inference with Spatial-enriched Graph Neural Networks [Download Paper] [Scalable Data Science] Yile Chen (Nanyang Technological University)*, Xiucheng Li (Harbin Institute of Technology), Gao Cong (Nanyang Technological Univesity), Cheng Long (Nanyang Technological University), Zhifeng Bao (RMIT University), Shang Liu (Nanyang Technological University), Wanli Gu (MEITUAN), Fuzheng Zhang (Meituan-Dianping Group) As a fundamental component in location-based services, inferring the relationship between points-of-interests (POIs) is very critical for service providers to offer good user experience to business owners and customers. Most of the existing methods for relationship inference are not targeted at POI, thus failing to capture unique spatial characteristics that have huge effects on POI relationships. In this work we propose PRIM to tackle POI relationship inference for multiple relation types. PRIM features four novel components, including a weighted relational graph neural network, category taxonomy integration, a self-attentive spatial context extractor, and a distance-specific scoring function. Extensive experiments on two real-world datasets show that PRIM achieves the best results compared to state-of-the-art baselines and it is robust against data sparsity and is applicable to unseen cases in practice.
METRO: A Generic Graph Neural Network Framework for Multivariate Time Series Forecasting [Download Paper] Yue Cui (The Hong Kong University of Science and Technology)*, Kai Zheng (University of Electronic Science and Technology of China), Dingshan Cui (Sichuan University), Jiandong Xie (HUAWEI TECHNOLOGIES CO.LTD.), Liwei Deng (University of Electronic Science and Technology of China), Feiteng Huang (Huawei Cloud Database Innovation Lab), Xiaofang Zhou (The Hong Kong University of Science and Technology) Multivariate time series forecasting has been drawing increasing attention due to its prevalent applications. It has been commonly assumed that leveraging latent dependencies between pairs of variables can enhance the prediction accuracy. However, most existing methods suffer from static variable relevance modeling and ignorance of correlation between temporal scales, thereby failing to fully retain the dynamic and periodic interdependencies among variables, which are vital for long- and short-term forecasting. In this paper, we propose METRO, a generic framework with multi-scale temporal graphs neural networks, which models the dynamic and cross-scale variable correlations simultaneously. By representing the multivariate time series as a series of temporal graphs, both intra- and inter-step correlations can be well preserved via message-passing and node embedding update. To enable information propagation across temporal scales, we design a novel sampling strategy to align specific steps between higher and lower scales and fuse the cross-scale information efficiently. Moreover, we provide a modular interpretation of existing GNN-based time series forecasting works as specific instances under our framework. Extensive experiments conducted on four benchmark datasets demonstrate the effectiveness and efficiency of our approach. METRO has been successfully deployed onto the time series analytics platform of Huawei Cloud, where an 1-month online test demonstrated that up to 20\% relative improvement over state-of-the-art models w.r.t. RSE can be achieved.
TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data [Download Paper] Shreshth Tuli (Imperial College London)*, Giuliano Casale (Imperial College London), Nicholas R Jennings (Loughborough University) Efficient anomaly detection and diagnosis in multivariate time-series data is of great importance for modern industrial applications. However, building a system that is able to quickly and accurately pinpoint anomalous observations is a challenging problem. This is due to the lack of anomaly labels, high data volatility and the demands of ultra-low inference times in modern applications. Despite the recent developments of deep learning approaches for anomaly detection, only a few of them can address all of these challenges. In this paper, we propose TranAD, a deep transformer network based anomaly detection and diagnosis model which uses attention-based sequence encoders to swiftly perform inference with the knowledge of the broader temporal trends in the data. TranAD uses focus score-based self-conditioning to enable robust multi-modal feature extraction and adversarial training to gain stability. Additionally, model-agnostic meta learning (MAML) allows us to train the model using limited data. Extensive empirical studies on six publicly available datasets demonstrate that TranAD can outperform state-of-the-art baseline methods in detection and diagnosis performance with data and time-efficient training. Specifically, TranAD increases F1 scores by up to 17%, reducing training times by up to 99% compared to the baselines.
Database Workload Characterization with Query Plan Encoders [Download Paper] Debjyoti Paul (University of Utah)*, Jie Cao (University of Utah), Feifei Li (University of Utah), Vivek Srikumar (University of Utah) Smart databases are adopting artificial intelligence (AI) technologies to achieve {\em instance optimality}, and in the future, databases will come with prepackaged AI models within their core components. The reason is that every database runs on different workloads, demands specific resources, and settings to achieve optimal performance. It prompts the necessity to understand workloads running in the system along with their features comprehensively, which we dub as workload characterization. To address this workload characterization problem, we propose our query plan encoders that learn essential features and their correlations from query plans. Our pretrained encoders capture the {\em structural} and the {\em computational performance} of queries independently. We show that our pretrained encoders are adaptable to workloads that expedite the transfer learning process. We performed independent assessments of structural encoder and performance encoders with multiple downstream tasks. For the overall evaluation of our query plan encoders, we architect two downstream tasks (i) query latency prediction and (ii) query classification. These tasks show the importance of feature-based workload characterization. We also performed extensive experiments on individual encoders to verify the effectiveness of representation learning and domain adaptability.
Selective Data Acquisition in the Wild for Model Charging [Download Paper] Chengliang Chai (Tsinghua University), Jiabin Liu (Tsinghua University), Nan Tang (Qatar Computing Research Institute, HBKU), Guoliang Li (Tsinghua University)*, Yuyu Luo (Tsinghua University) The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g. enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data instances from the data in the wild as additional train data that can help the ML task. It consists of two steps. The first step is to discover relevant datasets (e.g. tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data instances they contain can help. The second step is to select which data instances from these candidate datasets should be used. We build an end-to-end solution to solve this problem. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called Dataselect. It first clusters all data instances from candidate datasets such that each cluster contains similar data instances from different sources. It then iteratively picks which cluster to use, samples data instances (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show that our methods outperform baselines for selecting data instances from candidate datasets obtained from multiple sources, including using the entire candidate datasets, selecting only similar data instances, active learning-based methods, and using coresets.
A Scalable AutoML Approach Based on Graph Neural Networks [Download Paper] Mossad Helali (Concordia University)*, Essam Mansour (Concordia University), Ibrahim Abdelaziz (IBM Research), Julian Dolby (IBM Research), Kavitha Srinivas (IBM Research) AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines by mining thousands of scripts with program analysis, (2) uses dataset embeddings to find similar datasets in the database based on its content instead of metadata-based features, (3) models AutoML pipeline creation as a graph generation problem, to succinctly characterize the diverse pipelines seen for a single dataset. KGpip's meta-learning is a sub-component for AutoML systems. We demonstrate this by integrating KGpip with two AutoML systems. Our comprehensive evaluation using 121 datasets, including those used by the state-of-the-art systems, shows that KGpip significantly outperforms these systems.
SA-LSM : Optimize Data Layout for LSM-tree Based Storage using Survival Analysis [Download Paper] Teng Zhang (Alibaba Group)*, Jian Tan (Alibaba), Xin Cai (Alibaba ), Jianying Wang (Alibaba Inc.), Feifei Li (Alibaba Group), Jianling Sun (Zhejiang University) A significant fraction of data in cloud storage is rarely accessed, referred to as cold data. Accurately identifying and efficiently managing cold data on cost-effective storages is one of the major challenges for cloud providers, which balances between reducing the cost and improving the system performance. To this end, we propose SA-LSM to use (S)urvival (A)nalysis for Log-Structure Merge Tree (LSM-tree) key-value (KV) stores. Conventionally, the data layout of LSM-tree is determined jointly by the write and the compaction operations. However, this process by default does not fully utilize the access information of data records, leading to a suboptimal data layout that negatively impacts the system performance. SA-LSM utilizes the survival analysis, a statistical learning algorithm commonly used in biostatistics, to optimize the data layout. When put into perspective of LSM-tree with proper adoptions, SA-LSM can accurately predict cold data using the historical semantic information and access traces. As a concrete realization, we implement our proposal in X-Engine, a commercial-strength open-source LSM-tree storage engine. To make the deployment more flexible, we also design a non-intrusive architecture that offloads CPU-intensive work, e.g., model training and inference, to an external service. Extensive experiments on real-world workloads show that it can decrease the tail latency by up to 78.9% compared to the state-of-the-art techniques. The generality of this approach and the significant performance improvement show great potentials in a variety of related applications.
Projected Federated Averaging with Heterogeneous Differential Privacy [Download Paper] Junxu Liu (Renmin University of China)*, Jian Lou (Emory University), Li Xiong (Emory University), Jinfei Liu (Zhejiang University), Xiaofeng Meng (Renmin University of China) Cross-silo Federated Learning (FL) emerges as a promising framework for multiple institutions to collaboratively learn a joint model without directly sharing the data. In addition to high utility of the joint model, rigorous protection of the sensitive data and communication efficiency are among the key design desiderata of a successful FL algorithm. Many existing efforts achieve rigorous privacy by ensuring differential privacy for the intermediate model parameters, however, they typically assume a uniform privacy parameter for all the sites. In practice, different institutions may have different privacy requirements due to varying privacy policies or preferences of the data subjects. In this paper, we focus on explicitly modeling and leveraging the heterogeneous privacy requirements of different institutions. We formalize it as the heterogeneous differentially private federated learning problem and study how to optimize utility for the joint model while minimizing communication cost. As differentially private perturbations inevitably affect the model utility, a natural idea is to make better use of information submitted by the institutions with higher privacy budgets (referred to as ¡°public¡± clients, and the opposite are ¡°private¡± clients). The challenge is how to use such information without biasing the global model. To this end, we propose the Projected Federated Averaging with heterogeneous differential privacy, named as PFA, which extracts the top singular subspace of the model updates submitted by ¡°public¡± clients and then utilizes them to project the model updates of ¡°private¡± clients before aggregating them. We further propose the communication-efficient PFA+, which allows ¡°private¡± clients to upload projected parameters instead of original parameters using the projection space learned from the previous round. Our experiments on both statistical learning and deep learning verify the utility boost of both algorithms compared to the baseline methods, whereby PFA+ achieves over 99% uplink communication re-duction for ¡°private¡± clients. Our implementation is publicly available.
Effective Community Search over Large Star-Schema Heterogeneous Information Networks [Download Paper] Yangqin Jiang (The Chinese University of Hong Kong, Shenzhen)*, Yixiang Fang (School of Data Science, The Chinese University of Hong Kong, Shenzhen), Chenhao Ma (The University of Hong Kong), Xin Cao (University of New South Wales), Chunshan Li (Harbin Institute of Technology) Community search (CS) enables personalized community discovery and has found a wide spectrum of emerging applications such as setting up social events and friend recommendation. While CS has been extensively studied for conventional homogeneous networks, the problem for heterogeneous information networks (HINs) has received attention only recently. However, existing studies suffer from several limitations, e.g., they either require users to specify a meta-path or relational constraints, which pose great challenges to users who are not familiar with HINs. To address these limitations, in this paper, we systematically study the problem of CS over large star-schema HINs without asking users to specify these constraints; that is, given a set Q of query vertices with the same type, find the most-likely community from a star-schema HIN containing Q, in which all the vertices are with the same type and close relationships. To capture the close relationships among vertices of the community, we employ the meta-path-based core model, and maximize the number of shared meta-paths such that each of them results in a cohesive core containing Q. To enable efficient CS, we first develop online algorithms via exploiting the anti-monotonicity property of shared meta-paths. We further boost the efficiency by proposing a novel index and an efficient index-based algorithm with elegant pruning techniques. Extensive experiments on four real large star-schema HINs show that our solutions are effective and efficient for searching communities, and the index-based algorithm is much faster than the online algorithms.
An I/O-Efficient Disk-based Graph System for Scalable Second-Order Random Walk of Large Graphs [Download Paper] Hongzheng Li (Beijing University of Posts and Telecommunications), Yingxia Shao (BUPT)*, Junping Du (Beijing University of Posts and Telecommunications), Bin Cui (Peking University), Lei Chen (Hong Kong University of Science and Technology) Random walk is widely used in many graph analysis tasks, especially the first-order random walk. However, as a simplification of real-world problems, the first-order random walk is poor at modeling higher-order structures in the data. Recently, second-order random walk-based applications (e.g., Node2vec, Second-order PageRank) have become attractive. Due to the complexity of the second-order random walk models and memory limitation, it is not scalable to run second-order random walk-based applications in a single machine. Existing disk-based graph systems are only friendly to the first-order random walk models and suffer from expensive disk I/Os when executing the second-order random walks. This paper introduces an I/O-efficient disk-based graph system for the scalable second-order random walk of large graphs, called GraSorw. First, to eliminate massive light vertex I/Os, we develop a bi-block execution engine that converts random I/Os into sequential I/Os by applying a new triangular bi-block scheduling strategy, the bucket-based walk management, and the skewed walk storage. Second, to improve the I/O utilization, we design a learning-based block loading model to leverage the advantages of the full-load and on-demand load methods. Finally, we conducted extensive experiments on five large datasets. The empirical results demonstrate that the end-to-end time cost of popular tasks in GraSorw is reduced by more than one order of magnitude compared to the existing disk-based graph systems.
RapidFlow: An Efficient Approach to Continuous Subgraph Matching [Download Paper] Shixuan Sun (National University of Singapore)*, Xibo Sun (Hong Kong University of Science and Technology), Bingsheng He (National University of Singapore), Qiong Luo (Hong Kong University of Science and Technology) Continuous subgraph matching (CSM) is an important building block in many real-time graph processing applications. Given a subgraph query Q and a data graph stream, a CSM algorithm reports the occurrences of Q in the stream. Specifically, when a new edge e arrives in the stream, existing CSM algorithms start from the inserted e in the current data graph G to search Q. However, this rigid matching order of always starting from e can lead to a massive number of partial results that will turn out futile. Also, if Q contains automorphisms, there will be a lot of redundant computation in the matching process. To address these two problems, we propose RapidFlow, an effective approach to CSM. First, we design a query reduction technique, which reduces CSM to batch subgraph matching (BSM) where we enumerate all results in a region of G that will be affected by the update. The well-established BSM techniques can determine effective matching orders, not necessarily starting from the newly inserted edge. Second, to eliminate redundant computation caused by automorphisms in Q, we propose dual matching, which leverages the duality of Q and G in the matching process. Extensive experiment results show that RapidFlow outperforms state-of-the-art algorithms, including TurboFlux and SymBi, by up to two orders of magnitude on various workloads.
Finding Locally Densest Subgraphs: A Convex Programming Approach [Download Paper] Chenhao Ma (The University of Hong Kong)*, Reynold Cheng ("The University of Hong Kong, China"), Laks V.s. Lakshmanan (The University of British Columbia), Xiaolin Han (The University of Hong Kong) Finding the densest subgraph (DS) from a graph is a fundamental problem in graph databases. The DS obtained, which reveals closely related entities, has been found to be useful in various application domains such as e-commerce, social science, and biology. However, in a big graph that contains billions of edges, it is desirable to find more than one subgraph cluster that are not necessarily the densest, yet they reveal closely-related vertices. In this paper, we study the locally densest subgraph (LDS), a recently-proposed variant of DS. An LDS is a subgraph which is the densest among the ``local neighbors''. Given a graph G, a number of LDS's can be returned, which reflect different dense regions of G and thus give more information than DS. The existing LDS solution suffers from low efficiency. We thus develop a convex-programming-based solution that enables powerful pruning. Extensive experiments on seven real large graph datasets show that our proposed algorithm is up to four orders of magnitude faster than the state-of-the-art.
Efficient Label-Constrained Shortest Path Queries on Road Networks: A Tree Decomposition Approach [Download Paper] Junhua Zhang (UTS), Long Yuan (Nanjing University of Science and Technology)*, Wentao Li (University of Technology Sydney), Lu Qin (UTS), Ying Zhang (University of Technology Sydney) Computing the shortest path between two vertices is a fundamental problem in road networks. Most of the existing works assume that the edges in the road networks have no labels, but in many real applications, the edges have labels and label constraints may be placed on the edges appearing on a valid shortest path. Hence, we study the label-constrained shortest path queries in this paper. In order to process such queries efficiently, we adopt an index-based approach and propose a novel index structure, LSD-Index, based on tree decomposition. With LSD-Index, we design an efficient query processing algorithm with good performance guarantees. Moreover, we also propose an algorithm to construct LSD-Index and further improve the efficiency of index construction by exploiting the parallel computing techniques. We conduct extensive performance studies using large real road networks including the whole USA road network. Compared with the state-of-the-art approach, the experimental results demonstrate that our algorithm not only achieves up to 2 orders of magnitude speedup in query processing time but also consumes much less index space. Meanwhile, the indexing time is also competitive, especially that for the parallel index construction algorithm.
(p,q)-biclique counting and enumeration for large sparse bipartite graphs [Download Paper] Jianye Yang (Guangzhou University)*, Yun Peng (Guangzhou University), Wenjie Zhang (University of New South Wales) In this paper, we study the problem of (p,q)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph $G=(U,V,E)$, and two integer parameters $p$ and $q$, we aim to efficiently count and enumerate all (p,q)-bicliques in $G$, where a (p,q)-biclique B(L,R) is a complete subgraph of G with $L \subseteq U$, $R \subseteq V$, $|L|=p$, and $|R|=q$. The problem of (p,q)-biclique counting and enumeration has many applications, such as graph neural network information aggregating, densest subgraph detection, and cohesive subgroup analysis, etc. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature. This problem is computationally challenging, due to the worst-case exponential number of (p,q)-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a variety of pruning techniques. Although BCList offers a useful computation framework to our problem, its worst-case time complexity is exponential to $p+q$. To alleviate this, we propose an advanced approach, called BCList++. Particularly, BCList++ applies a layer based exploring strategy to enumerate (p,q)-bicliques by anchoring the search on either $U$ or $V$ only, which has a worst-case time complexity exponential to either $p$ or $q$ only. Consequently, a vital task is to choose a layer with the least computation cost. To this end, we develop a cost model, which is built upon an unbiased estimator for the density of $2$-hop graph induced by $U$ or $V$. To improve computation efficiency, BCList++ exploits pre-allocated arrays and vertex labeling techniques such that the frequent subgraph creating operations can be substituted by array element switching operations. We conduct extensive experiments on $16$ real-life datasets, and the experimental results demonstrate that BCList++ significantly outperforms the baseline methods by up to $3$ orders of magnitude. We show via a case study that (p,q)-bicliques optimize the efficiency of graph neural networks.
Banyan: A Scoped Dataflow Engine for Graph Query Service [Download Paper] Li Su (Alibaba Group)*, Xiaoming Qin (Alibaba Group), Zichao Zhang (Alibaba Group), Rui Yang (University of Illinois Urbana-Champaign), Le Xu (UIUC), Indranil Gupta (UIUC), Wenyuan Yu (Alibaba Group), Zeng Kai (Alibaba Group), Jingren Zhou (Alibaba Group) Graph query services (GQS) are widely used today to interactively answer graph traversal queries on large-scale graph data. Existing graph query engines focus largely on optimizing the latency of a single query. This ignores significant challenges posed by GQS, including fine-grained control and scheduling during query execution, as well as performance isolation and load balancing in various levels from across user to intra-query. To tackle these control and scheduling challenges, we propose a novel scoped dataflow for modeling graph traversal queries, which explicitly exposes concurrent execution and control of any subquery to the finest granularity. We implemented Banyan, an engine based on the scoped dataflow model for GQS. Banyan focuses on scaling up the performance on a single machine, and provides the ability to easily scale out. Extensive experiments on multiple benchmarks show that Banyan improves performance by up to three orders of magnitude over state-of-the-art graph query engines, while providing performance isolation and load balancing.
Effective Indexing for Dynamic Structural Graph Clustering [Download Paper] Fangyuan Zhang (The Chinese University of Hong Kong), Sibo Wang (The Chinese University of Hong Kong)* Graph clustering is a fundamental data mining task that clusters vertices into different groups. The structural graph clustering algorithm ($SCAN$) is a widely used graph clustering algorithm that derives not only clustering results, but also special roles of vertices like hubs and outliers. In this paper, we consider structural graph clustering under Jaccard similarity on dynamic graphs. The state-of-the-art index-based solution focuses on static graphs and takes prohibitive update costs to maintain the index. Recently, an approximate dynamic structural graph clustering algorithm under Jaccard similarity is proposed, reducing the expected update cost to $O\left(\log^2{n}+ \log{n}\cdot \log{(M/p_f)}\right)$, which guarantees that the returned clustering result satisfies the approximation definition with probability $1-p_f$ after up to $M$ updates. However, their solution needs to fix the input parameters while the parameter settings of SCAN usually need to be fine-tuned to achieve good clustering results. Motivated by these limitations, we present a study on devising effective index structures for SCAN algorithm on dynamic graphs. Similar to the state-of-the-art dynamic scheme, our main idea to reduce the time complexity is still by bringing approximation to clustering results. However, our solution does not need to fix the input parameters. To achieve this, our solution includes two key components. The first is to maintain a bottom-$k$ sketch for each vertex so that the similarities of affected vertices can be easily updated. The second key is a bucketing strategy that allows us to update clustering results and roles of vertices efficiently. Our theoretical analysis shows that our proposed algorithm achieves $O(\log{n}\cdot \log{\frac{M+m}{p_f}})$ expected update cost and guarantees to return approximate clustering result with probability $1-p_f$ after up to $M$ updates. Extensive experiments show that our solution is up to two orders of magnitude faster than the state-of-the-art index-based solution while still achieving high-quality clustering results.
Diversified Top-k Route Planning in Road Network [Download Paper] Zihan Luo (The Hong Kong University of Science and Technology (Guang Zhou))*, Lei Li (The Hong Kong University of Science and Technology (Guang Zhou)), Mengxuan Zhang (Iowa State University), Wen Hua (The University of Queensland), Yehong Xu (Hongkong university of science and technology), Xiaofang Zhou (The Hong Kong University of Science and Technology) Route planning is ubiquitous and has a profound impact on our daily life. However, the existing path algorithms tend to produce similar paths between similar OD (Origin-Destination) pairs because they optimize query results without considering their influence on the whole network, which further introduces congestions. Therefore, we investigate the problem of diversifying the top-k paths between an OD pair such that their similarities are under a threshold while their total length is minimal. However, the current solutions all depend on the expensive graph traversal which is too slow to apply in practice. Therefore, we first propose an edge deviation and concatenation-based method to avoid the expensive graph search in path enumeration. After that, we dive into the path relations and propose a path similarity computation method with constant complexity, and propose a pruning technique to improve efficiency. Finally, we provide the completeness and efficiency-oriented solutions to further accelerate the query answering. Evaluations on the real-life road networks demonstrate the effectiveness and efficiency of our algorithm over the state-of-the-art.
Fast Algorithms for Core Maximization on Large Graphs [Download Paper] Xin Sun (Tianjin University)*, Xin Huang (Hong Kong Baptist University), Di Jin (Tianjin University) Core maximization which enlarges the $k$-core as much as possible by inserting a few new edges into a graph, is particularly useful to improve the dense substructure's stability on various real-life applications, such as the friendship-based communities in social networks, the topology extension in distributed networks, and the communication stability for resource exchanging in P2P networks. However, the core maximization problem has been theoretically proven to be NP-hard even APX-hard for $k\geq3$. Existing heuristic approaches suffer from the inefficiency limitation, which is not scalable on large graphs. To address this limitation, in this paper, we revisit this challenging but yet important problem of core maximization, that is, given a graph $G$, a number $k$, and a budget $b$, to insert $b$ new edges into $G$ such that the $k$-core subgraph is maximized. We propose a novel algorithm \FCM based on three fast search strategies. The core idea is applying graph partition to divide $(k-1)$-shell into different components, and then considering each component independently to convert different level vertices into $k$-core. The time complexity of \FCM is theoretically analyzed in $O(mn+bhn)$ where $n$, $m$, $h$ are the number of vertices, edges, and $(k-1)$-shell subsets respectively and $h\ll n$, which is clearly faster than $O(bmn^2)$ by the SOTA algorithm \EKC~\cite{zhou2019k}. Experimental results on eleven datasets demonstrate that our algorithm runs 27,000\textbf{x} faster than \EKC on large graphs meanwhile achieving highly competitive quality results.
FACE: A Normalizing Flow based Cardinality Estimator [Download Paper] Jiayi Wang (Tsinghua University), Chengliang Chai (Tsinghua University), Jiabin Liu (Tsinghua University), Guoliang Li (Tsinghua University)* Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality; while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries; while data-driven methods have no such limitation and have high adaptivity. In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator FACE, which leverages the Normalizing Flow based model to learn a continuous joint distribution for relational data. FACE can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution), and use the probability density to estimate the cardinality. First, we design a dequantization method to make data more "continuous". Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to efficiently estimate the cardinality. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.
JENNER: Just-in-time Enrichment in Query Processing [Download Paper] Dhrubajyoti Ghosh (UC Irvine)*, Peeyush Gupta (UC Irvine), Sharad Mehrotra (U.C. Irvine), Roberto Yus (University of Maryland, Baltimore County), Yasser Altowim (King Abdulaziz City for Science and Technology) Emerging domains, such as sensor-driven smart spaces and social media analytics, require incoming data to be enriched prior to its use. Enrichment often consists of machine learning (ML) functions that are too expensive/infeasible to execute at ingestion. In this paper, we develop a strategy entitled Just-in-time ENrichmeNt in quERy Processing (JENNER) to support interactive analytics over data as soon as it arrives for such application context. JENNER exploits the inherent tradeoffs of cost and quality often displayed by the ML functions to progressively improve query answers during query execution. We describe how JENNER works for a large class of SPJ and aggregation queries that form the bulk of data analytics workload. Our experimental results on real datasets (IoT and Tweet) show that JENNER achieves progressive answers and performs significantly better than the naive strategies of achieving progressive computation.
LEGOStore: A Linearizable Geo-Distributed Store Combining Replication and Erasure Coding [Download Paper] Hamidreza Zare (Pennsylvania State University)*, Viveck Cadambe (Pennsylvania State University), Bhuvan Urgaonkar (Penn State), Nader Alfares (Penn State), Praneet Soni (Pennsylvania State University), Chetan Sharma (Penn State University), Arif Merchant (Google Inc.) We design and implement LEGOStore, an erasure coding (EC) based linearizable data store over geo-distributed public cloud data centers (DCs). For such a data store, the confluence of the following factors opens up opportunities for EC to be latency-competitive with replication: (a) the necessity of communicating with remote DCs to tolerate entire DC failures and implement linearizability; and (b) the emergence of DCs near most large population centers. LEGOStore employs an optimization framework that, for a given object, carefully chooses among replication and EC, as well as among various DC placements to minimize overall costs. To handle workload dynamism, LEGOStore employs a novel agile reconfiguration protocol. Our evaluation using a LEGOStore prototype spanning 9 Google Cloud Platform DCs demonstrates the efficacy of our ideas. We observe cost savings ranging from moderate (5-20%) to significant (60%) over baselines representing the state of the art while meeting tail latency SLOs. Our reconfiguration protocol is able to transition key placements in 3 to 4 inter-DC RTTs (<1s in our experiments), allowing for agile adaptation to dynamic conditions.
Velox: Meta's Unified Execution Engine [Industry] [Download Paper] Pedro Pedreira (Meta Platforms Inc.)*, Orri Erling (Facebook), Maria Basmanova (Facebook), Kevin Wilfong (Facebook), Laith s Sakka (Meta), Krishna Pai (Meta), Wei He (Meta Platforms, Inc.), Biswapesh Chattopadhyay (Facebook) The ad-hoc development of new specialized computation engines targeted to very specific data workloads has created a siloed data landscape. Commonly, these engines share little to nothing with each other and are hard to maintain, evolve, and optimize, and ultimately provide an inconsistent experience to data users. In order to address these issues, Meta has created Velox, a novel open source C++ database acceleration library. Velox provides reusable, extensible, high-performance, and dialect-agnostic data processing components for building execution engines, and enhancing data management systems. The library heavily relies on vectorization and adaptivity, and is designed from the ground up to support efficient computation over complex data types due to their ubiquity in modern workloads. Velox is currently integrated or being integrated with more than a dozen data systems at Meta, including analytical query engines such as Presto and Spark, stream processing platforms, message buses and data warehouse ingestion infrastructure, machine learning systems for feature engineering and data preprocessing (PyTorch), and more. It provides benefits in terms of (a) efficiency wins by democratizing optimizations previously only found in individual engines, (b) increased consistency for data users, and (c) engineering efficiency by promoting reusability.
Fast Neural Ranking on Bipartite Graph Indices [Download Paper] [Scalable Data Science] Shulong Tan (Baidu Research)*, Weijie Zhao (Rochester Institute of Technology), Ping Li (Baidu) Neural network based ranking is widely adopted due to its powerful capacity in modeling complex relationships, such as between users and items, questions and answers. Online neural network ranking ¨C so called fast neural ranking ¨C is considered challenging because neural network measures are usually non-convex and asymmetric. Traditional Approximate Nearest Neighbor (ANN) search which usually focuses on metric ranking measures, is not applicable to these advanced measures. In this paper, we propose to construct BipartitE Graph INdices (BEGIN) for fast neural ranking. BEGIN contains two types of nodes: base/searching objects and sampled queries. The edges connecting these types of nodes are constructed via the neural network ranking measure. The proposed algorithm is a natural extension from the traditional search on graph methods and is more suitable for fast neural ranking. Experiments demonstrate the effectiveness and efficiency of the proposed method.
Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing [Download Paper] Chenghao Lyu (University of Massachusetts Amherst)*, Qi Fan (Ecole Polytechnique), Fei Song (Ecole Polytechnique), Arnab Sinha (Ecole Polytechnique), Yanlei Diao (Ecole Polytechnique), Wei Chen (Alibaba), Li Ma (Alibaba Group), Yihui Feng (Alibaba Group), Yaliang Li (Alibaba Group), Kai Zeng (Alibaba Group), Jingren Zhou (Alibaba Group) Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level recommendations in a hierarchical MOO framework. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.
Enabling Personal Consent in Databases [Download Paper] George Konstantinidis (University of Southampton)*, Jet Holt (University of Southampton), Adriane Chapman (University of Southampton) Users have the right to consent to the use of their data, but current methods are limited to very coarse-grained expressions of consent, as ¡°opt-in/opt-out¡± choices for certain uses. In this paper we identify the need for fine-grained consent management and formalize how to express and manage user consent and personal contracts of data usage in relational databases. Unlike privacy approaches, our focus is not on preserving confidentiality against an adversary, but rather cooperate with a trusted service provider to abide by user preferences in an algorithmic way. Our approach enables data owners to express the intended data usage in formal specifications, that we call consent constraints, and enables a service provider that wants to honor these constraints, to automatically do so by filtering query results that violate consent; rather than both sides relying on ¡°terms of use¡± agreements written in natural language. We provide formal foundations (based on provenance), algorithms (based on unification and query rewriting), connections to data privacy, and complexity results for supporting consent in databases. We implement our framework in an open source RDBMS, and provide an evaluation against the most relevant privacy approach using the TPC-H benchmark, and on a real dataset of ICU data.
Scabbard: Single-Node Fault-Tolerant Stream Processing [Download Paper] [Research] Georgios R Theodorakis (Imperial College London)*, Fotios Kounelis (Imperial College London), Peter Pietzuch (Imperial College London), Holger Pirk (Imperial College, UK) Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs. We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies.
Chimp: Efficient Lossless Floating Point Compression for Time Series Databases [Download Paper] Panagiotis Liakos (University of Athens)*, Katia Papakonstantinopoulou (Athens University of Economics and Business), Yannis Kotidis (Athens University of Economics and Business) Applications in diverse domains such as astronomy, economics and industrial monitoring, increasingly press the need for analyzing massive collections of time series data. The sheer size of the latter hinders our ability to efficiently store them and also yields significant storage costs. Applying general purpose compression algorithms would effectively reduce the size of the data, at the expense of introducing significant computational overhead. Time Series Management Systems that have emerged to address the challenge of handling this overwhelming amount of information, cannot suffer the ingestion rate restrictions that such compression algorithms would cause. Data points are usually encoded using faster, streaming compression approaches. However, the techniques that contemporary systems use do not fully utilize the compression potential of time series data, with implications in both storage requirements and access times. In this work, we propose a novel streaming compression algorithm, suitable for floating point time series data. We empirically establish properties exhibited by a diverse set of time series and harness these features in our proposed encodings. Our experimental evaluation demonstrates that our approach readily outperforms competing techniques, attaining compression ratios that are competitive with slower general purpose algorithms, and on average around 50% of the space required by state-of-the-art streaming approaches. Moreover, our algorithm outperforms all earlier techniques with regards to both compression and access time, offering a significantly improved trade-off between space and speed. The aforementioned benefits of our approach -in terms of all space requirements, compression time and read access- significantly improve the efficiency in which we can store and analyze time series data.
Provenance-based Data Skipping [Download Paper] Xing Niu (Illinois Institute of Technology)*, Boris Glavic (Illinois Institute of Technology), Ziyu Liu (Illinois institute of thechnology), Pengyuan Li (Illinois institute of thechnology), Dieter Gawlick (Oracle), Vasudha Krishnaswamy (Oracle, USA), Zhen Hua Liu (Oracle), Danica Porobic (Oracle) Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps.
Efficient Temporal Pattern Mining in Big Time Series Using Mutual Information [Download Paper] Van Long Ho (Aalborg University)*, Nguyen Ho (Aalborg University), Torben Bach Pedersen (Aalborg University) Very large time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in different environments. Significant insights can be gained by mining temporal patterns from these time series. Unlike traditional pattern mining, temporal pattern mining (TPM) adds event time intervals into extracted patterns, making them more expressive at the expense of increased mining time complexity. Existing TPM methods either cannot scale to large datasets, or work only on pre-processed temporal events rather than on time series. This paper presents our Frequent Temporal Pattern Mining from Time Series (FTPMfTS) approach which provides: (1) The end-to-end FTPMfTS process taking time series as input and producing frequent temporal patterns as output. (2) The efficient Hierarchical Temporal Pattern Graph Mining (HTPGM) algorithm that uses efficient data structures for fast support and confidence computation, and employs effective pruning techniques for significantly faster mining. (3) An approximate version of HTPGM that uses mutual information, a measure of data correlation known from information theory, to prune unpromising time series from the search space. (4) An extensive experimental evaluation showing that HTPGM outperforms the baselines in runtime and memory consumption, and can scale to big datasets. The approximate HTPGM is up to two orders of magnitude faster and less memory consuming than the baselines, while retaining high accuracy.
Babelfish: Efficient Execution of Polyglot Queries [Download Paper] Philipp M Grulich (Technische Universit?t Berlin)*, Steffen Zeuch (DFKI Berlin), Volker Markl (Technische Universit?t Berlin) Today¡¯s users of data processing systems come from different domains, have different levels of expertise, and prefer different programming languages. As a result, analytical workload requirements shifted from relational to polyglot queries involving user-defined functions (UDFs). Although some data processing systems support polyglot queries, they often embed third-party language runtimes. This embedding induces a high performance overhead, as it causes additional data materialization between execution engines. In this paper, we present Babelfish, a novel data processing engine designed for polyglot queries. Babelfish introduces an intermediate representation that unifies queries from different implementation languages. This enables new, holistic optimizations across operator and language boundaries, e.g., operator fusion and workload specialization. As a result, Babelfish avoids data transfers and enables efficient utilization of hardware resources. Our evaluation shows that Babelfish outperforms state-of-the-art data processing systems by up to one order of magnitude and reaches the performance of handwritten code. With Babelfish, we bridge the performance gap between relational and multi-language UDFs and lay the foundation for the efficient execution of future polyglot workloads.
Origami: A High-Performance Mergesort Framework [Download Paper] Arif Arman (Texas A&M University)*, Dmitri Loguinov (Texas A&M University) Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multithread. In this paper, we introduce Origami, an in-memory mergesort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in register sorter for small sequences that is up to 8¡Á faster than prior methods and a branchless streaming merger that achieves up to a 1.5x speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize thread-level concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2x faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.
Projection-Compliant Database Generation [Download Paper] Anupam Sanghi (Indian Institute of Science)*, Shadab Ahmed (Indian Institute of Science), Jayant R Haritsa (Indian Institute of Science) Synthesizing data using declarative formalisms has been persuasively advocated in contemporary data generation frameworks. In particular, they specify operator output volumes through row-cardinality constraints. However, thus far, adherence to these volumetric constraints has been limited to the Filter and Join operators. A critical deficiency is the lack of support for the Projection operator, which is at the core of basic SQL constructs such as Distinct, Union and Group By. The technical challenge here is that cardinality unions in multi-dimensional space, and not mere summations, need to be captured in the generation process. Further, dependencies across different data subspaces need to be taken into account. We address the above lacuna by presenting PiGen, a dynamic data generator that incorporates Projection cardinality constraints in its ambit. The design is based on a projection subspace division strategy that supports the expression of constraints using optimized linear programming formulations. Further, techniques of symmetric refinement and workload decomposition are introduced to handle constraints across different projection subspaces. Finally, PiGen supports dynamic generation, where data is generated on-demand during query processing, making it amenable to Big Data environments. A detailed evaluation on workloads derived from real-world and synthetic benchmarks demonstrates that PiGen can accurately and efficiently model Projection outcomes, representing an essential step forward in customized database generation.
Efficient Secure and Verifiable Location-Based Skyline Queries over Encrypted Data [Download Paper] Zuan Wang (HUST), Xiaofeng Ding (Huazhong University of Science and Technology)*, Hai Jin (Huazhong University of Science and Technology), Pan Zhou (Huazhong University of Science and Technology) Supporting secure location-based services on encrypted data that is outsourced to cloud computing platforms remains an ongoing challenge for efficiency due to expensive ciphertext calculation overhead. Furthermore, since the clouds may not be trustworthy or even malicious, data security and result authenticity has caused huge concerns. Unfortunately, little work can enable query efficiency, dataset confidentiality and result authenticity to be commendably guaranteed. In this paper, we demonstrate the potential of supporting secure and verifiable location-based skyline queries (SVLSQ). First, we devise a novel and unified structure, named semi-blind R-tree (SR-tree), which protects the query unlinkability. Based on SR-tree, we propose an authenticated data structure, named secure and verifiable scope R-tree (SVSR-tree). Then, we develop several secure protocols based on SVSR-tree to accelerate the query efficiency and reduce the size of verification objects. Our method avoids compromising the privacy of datasets, queries, results and access patterns. Meanwhile, it authenticates the soundness and completeness of the skyline results while preserving privacy. Finally, we analyze the complexity and security of SVLSQ. Findings from the performance evaluation illustrate that SVLSQ is a dramatically efficient method in terms of query (no less than 3 orders of magnitude faster than other solutions) and verification.
Query Driven-Graph Neural Networks for Community Search: From Non-Attributed, Attributed, to Interactive Attributed [Download Paper] Yuli Jiang (The Chinese Univercity of Hong Kong)*, Yu Rong (Tencent AI Lab), Hong Cheng (Chinese University of Hong Kong), Xin Huang (Hong Kong Baptist University), Kangfei Zhao (The Chinese University of Hong Kong), Junzhou Huang (University of Texas at Arlington) Given one or more query vertices, Community Search (CS) aims to find densely intra-connected and loosely inter-connected structures containing query vertices. Attributed Community Search (ACS), a related problem, is more challenging since it finds communities with both cohesive structures and homogeneous vertex attributes. However, most methods for the CS task rely on inflexible pre-defined structures and studies for ACS treat each attribute independently. Moreover, the most popular ACS strategies decompose ACS into two separate sub-problems, i.e., the CS task and subsequent attribute filtering task. However, in real-world graphs, the community structure and the vertex attributes are closely correlated to each other. This correlation is vital for the ACS problem. In this vein, we argue that the separation strategy cannot fully capture the correlation between structure and attributes simultaneously and it would compromise the final performance. In this paper, we propose Graph Neural Network (GNN) models for both CS and ACS problems, i.e., Query Driven-GNN (QD-GNN) and Attributed Query Driven-GNN (AQD-GNN). In QD-GNN, we combine the local query-dependent structure and global graph embedding. In order to extend QD-GNN to handle attributes, we model vertex attributes as a bipartite graph and capture the relation between attributes by constructing GNNs on this bipartite graph. With a Feature Fusion operator, AQD-GNN processes the structure and attribute simultaneously and predicts communities according to each attributed query. Experiments on real-world graphs with ground-truth communities demonstrate that the proposed models outperform existing CS and ACS algorithms in terms of both efficiency and effectiveness. More recently, an interactive setting for CS is proposed that allows users to adjust the predicted communities. We further verify our approaches under the interactive setting and extend to the attributed context. Our method achieves 2.37% and 6.29% improvements in F1-score than the state-of-the-art model without attributes and with attributes respectively.
ByteGNN: Efficient Graph Neural Network Training at Large Scale [Download Paper] Chenguang Zheng (CUHK)*, Hongzhi Chen (ByteDance), Yuxuan Cheng (ByteDance Inc), Zhezheng Song (CUHK), Yifan Wu (Peking University), Changji Li (CUHK), James Cheng (CUHK), Hao Yang (Bytedance.com), Shuai Zhang (Bytedance) Graph neural networks (GNNs) have shown excellent performance in a wide range of applications such as recommendation, risk control, and drug discovery. With the increase in the volume of graph data, distributed GNN systems become essential to support efficient GNN training. However, existing distributed GNN training systems suffer from various performance issues including high network communication cost, low CPU utilization, and poor end-to-end performance. In this paper, we propose ByteGNN, which ad- dresses the limitations in existing distributed GNN systems with three key designs: (1) an abstraction of mini-batch graph sampling to support high parallelism, (2) a two-level scheduling strategy to improve resource utilization and to reduce the end-to-end GNN training time, and (3) a graph partitioning algorithm tailored for GNN workloads. Our experiments show that ByteGNN outperforms the state-of-the-art distributed GNN systems with up to 3.5-23.8 times faster end-to-end execution, 2-6 times higher CPU utilization, and around half of the network communication cost.
Pre-training Summarization Models of Structured Datasets for Cardinality Estimation [Download Paper] Yao Lu (Microsoft Research)*, Srikanth Kandula (Microsoft Research), Arnd Christian K?nig (Microsoft), Surajit Chaudhuri (Microsoft) We consider the problem of pre-training models which convert structured datasets into succinct summaries that can be used to answer cardinality estimation queries. Doing so avoids per-dataset training and, in our experiments, reduces the time to construct summaries by up to 100¡Á. When datasets change, our summaries are incrementally updateable. Our key insights are to use multiple summaries per dataset, use learned summaries for columnsets for which other simpler techniques do not achieve high accuracy, and that analogous to similar pre-trained models for images and text, structured datasets have some common frequency and correlation patterns which our models learn to capture by pre-training on a large and diverse corpus of datasets.
UDO: Universal Database Optimization using Reinforcement Learning [Download Paper] Junxiong Wang (Cornell University)*, Immanuel Trummer (Cornell), Debabrota Basu (Inria) UDO is a versatile tool for offline tuning of database systems for specific workloads. UDO can consider a variety of tuning choices, reaching from picking transaction code variants over index selections up to database system parameter tuning. UDO uses reinforcement learning to converge to near-optimal configurations, creating and evaluating different configurations via actual query executions (instead of relying on simplifying cost models). To cater to different parameter types, UDO distinguishes heavy parameters (which are expensive to change, e.g. physical design parameters) from light parameters. Specifically for optimizing heavy parameters, UDO uses reinforcement learning algorithms that allow delaying the point at which the reward feedback becomes available. This gives us the freedom to optimize the point in time and the order in which different configurations are created and evaluated (by benchmarking a workload sample). UDO uses a cost-based planner to minimize reconfiguration overheads. For instance, it aims to amortize the creation of expensive data structures by consecutively evaluating configurations using them. We evaluate UDO on Postgres as well as MySQL and on TPC-H as well as TPC-C, optimizing a variety of light and heavy parameters concurrently.
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex [Download Paper] Immanuel Trummer (Cornell)* CodexDB enables users to customize SQL query processing via natural language instructions. CodexDB is based on OpenAI's GPT-3 Codex model which translates text into code. It is a framework on top of GPT-3 Codex that decomposes complex SQL queries into a series of simple processing steps, described in natural language. Processing steps are enriched with user-provided instructions and descriptions of database properties. Codex translates the resulting text into query processing code. An early prototype of CodexDB is able to generate correct code for up to 81% of queries for the WikiSQL benchmark and for up to 62% on the SPIDER benchmark.
Butterfly Counting on Uncertain Bipartite Networks [Download Paper] Alexander Zhou (Hong Kong University of Science and Technology)*, Yue Wang (Shenzhen Institute of Computing Sciences, Shenzhen University.), Lei Chen (Hong Kong University of Science and Technology) When considering uncertain bipartite networks, the number of instances of the popular graphlet structure the butterfly may be used as an important metric to quickly gauge information about the network. This Uncertain Butterfly Count has practical usages in a variety of areas such as biomedical/biological fields, E-Commerce and road networks. In this paper we formally define the uncertain butterfly structure (in which the existential probability of the butterfly is greater than or equal to some user-defined threshold ??) as well as the Uncertain Butterfly Counting Problem (to determine the number of unique instances of this structure on any uncertain bipartite network). We then examine exact solutions by proposing a non-trivial baseline (????????) as well as an improved solution (??????????) which reduces the time complexity and employs heuristics to further reduce the runtime in practice. In addition to exact solutions, we propose two approximate solutions via sampling, ?????? and ??????, which can be used to quickly estimate the Uncertain Butterfly Count, a powerful tool when the exact count is unnecessary. Using a range of networks with different edge existential probability distributions, we validate the efficiency and effectiveness of our solutions.
Answering Regular Path Queries through Exemplars [Download Paper] Komal Chauhan (IIT Delhi), Kartik Jain (IIT Delhi), Sayan Ranu (IIT Delhi)*, Srikanta Bedathur (IIT Delhi), Amitabha Bagchi (IIT Delhi) Regular simple path query (RPQ) is one of the fundamental operators in graph analytics. In an RPQ, the input is a graph, a source node and a regular expression. The goal is to identify all nodes that are connected to the source through a simple path whose label sequence satisfies the given regular expression. The regular expression acts as a formal specification of the search space that is of interest to the user. Although regular expressions have high expressive power, they act as barrier to non-technical users. Furthermore, to fully realize the power of regular expressions, the user must be familiar with the domain of the graph dataset. In this study, we address this bottleneck by bridging RPQs with the query-by-example paradigm. More specifically, we ask the user for an exemplar pair that characterizes the paths of interest, and the regular expression is automatically inferred from this exemplar. This novel problem introduces several new challenges. How do we infer the regex? Given that answering RPQs is NP-hard, how do we scale to large graphs? We address these challenges through a unique combination of Biermann and Feldman¡¯s algorithm with NFA-guided random walks with restarts. Extensive experiments on multiple real, million-scale datasets establish that RQuBE is at least 3 orders of magnitude faster than baseline strategies with an average accuracy in excess of 90\%.
DLCR: Efficient Indexing for Label-Constrained Reachability Queries on Large Dynamic Graphs [Download Paper] Xin Chen (The Chinese University of Hong Kong)*, You Peng (The Chinese University of Hong Kong), Sibo Wang (The Chinese University of Hong Kong), Jeffrey Xu Yu (Chinese University of Hong Kong) Many real-world graphs, e.g., social networks, biological networks, knowledge graphs, naturally come with edge-labels, with different labels representing different relationships between nodes. On such edge-labeled graphs, a fundamental query is a label-constrained reachability (LCR) query, where we are given a source ??, a target ??, a label set ¦·, and the goal is to determine if there exists any path from ?? to ?? such that for any edge on the path the label belongs to ¦·. The existing indexing scheme for LCR queries still focuses on static graphs, despite the fact that many edge-labeled graphs are dynamic in nature. Motivated by limitations of existing solutions, we present a study on how to effectively maintain the indexing scheme on dynamic graphs. Our proposed approach is based on the state-of-the-art 2-hop index for LCR queries. In this paper, we present efficient algorithms for updating the index structure in response to dynamic edge insertions/deletions and demonstrate the correctness of our update algorithms. Following that, we present that adopting a query-friendly but update-unfriendly indexing scheme results in surprisingly superb query/update efficiency and outperforms those update-friendly ones. We analyze and demonstrate that the query-friendly indexing scheme actually achieves the same time complexity as those of update-friendly ones. Finally, we present the batched update algorithms where the updates may include multiple edge insertions/deletions. Extensive experiments show the effectiveness of the proposed update algorithms, query-friendly indexing scheme, and batched update algorithms.
G-Tran: A High Performance Distributed Graph Database with a Decentralized Architecture [Download Paper] Hongzhi Chen (CUHK)*, Changji Li (CUHK), Chenguang Zheng (CUHK), Chenghuan Huang (CUHK), Juncheng Fang (CUHK), James Cheng (CUHK), Jian Zhang (The Chinese University of Hong Kong) Graph transaction processing poses unique challenges such as random data access due to the irregularity of graph structures, low throughput and high abort rate due to the relatively large read/write sets in graph transactions. To address these challenges, we present G-Tran, a remote direct memory access (RDMA)-enabled distributed in-memory graph database with serializable and snapshot isolation support. First, we propose a graph-native data store to achieve good data locality and fast data access for transactional updates and queries. Second, G-Tran adopts a fully decentralized architecture that leverages RDMA to process distributed transactions with the massively parallel processing (MPP) model, which can achieve high performance by utilizing all computing resources. In addition, we propose a new multi-version optimistic concurrency control (MV-OCC) protocol with two optimizations to address the issue of large read/write sets in graph transactions. Extensive experiments show that G-Tran achieves competitive performance compared with other popular graph databases on benchmark workloads.
Distributed D-core Decomposition over Large Directed Graphs [Download Paper] Xuankun Liao (Hong Kong Baptist University)*, Qing Liu (Hong Kong Baptist University), Jiaxin Jiang (National University of Singapore), Xin Huang (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University), Byron Choi (Hong Kong Baptist University) Given a directed graph $G$ and integers $k$ and $l$, a D-core is the maximal subgraph $H \subseteq G$ such that for every vertex of $H$, its in-degree and out-degree are no smaller than $k$ and $l$, respectively. For a directed graph $G$, the problem of D-core decomposition aims to compute the non-empty D-cores for all possible values of $k$ and $l$. In the literature, several \emph{peeling-based} algorithms have been proposed to handle D-core decomposition. However, the peeling-based algorithms that work in a sequential fashion and require global graph information during processing are mainly designed for \emph{centralized} settings, which cannot handle large-scale graphs efficiently in distributed settings. Motivated by this, we study the \emph{distributed} D-core decomposition problem in this paper. We start by defining a concept called \emph{anchored coreness}, based on which we propose a new H-index-based algorithm for distributed D-core decomposition. Furthermore, we devise a novel concept, namely \emph{skyline coreness}, and show that the D-core decomposition problem is equivalent to the computation of skyline corenesses for all vertices. We design an efficient D-index to compute the skyline corenesses distributedly. We implement the proposed algorithms under both vertex-centric and block-centric distributed graph processing frameworks. Moreover, we theoretically analyze the algorithm and message complexities. Extensive experiments on large real-world graphs with billions of edges demonstrate the efficiency of the proposed algorithms in terms of both the running time and communication overhead.
MIDE: Accuracy Aware Minimally Invasive Data Exploration For Decision Support [Download Paper] Sameera Ghayyur (UC Irvine)*, Dhrubajyoti Ghosh (UC Irvine), Xi He (University of Waterloo), Sharad Mehrotra (U.C. Irvine) This paper studies privacy in the context of decision-support queries that classify objects as either true or false based on whether they satisfy the query. Mechanisms to ensure privacy may result in false positives and false negatives. In decision-support applications, often, false negatives have to remain bounded. Existing accuracy-aware privacy preserving techniques, e.g., Apex, cannot directly be used to support such an accuracy requirement and their naive adaptations to support bounded accuracy of false negatives results in significant privacy loss depending upon distribution of data. This paper explores the concept of minimally-invasive data exploration for decision support that attempts to minimize privacy loss while supporting bounded guarantee on false negatives by adaptively adjusting privacy based on data distribution. Our experimental results show that the MIDE algorithms perform well and are robust over variations in data distributions.
HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data [Download Paper] Fumiyuki Kato (Kyoto University)*, Tsubasa Takahashi (LINE Corporation), Shun Takagi (Kyoto University), Yang Cao (Kyoto University), Seng Pei Liew (LINE Corporation), Masatoshi Yoshikawa (Kyoto University) How can we explore the unknown properties of high-dimensional sensitive relational data while preserving privacy? We study how to construct an explorable privacy-preserving materialized view under differential privacy. No existing state-of-the-art methods simultaneously satisfy the following essential properties in data exploration: workload independence, analytical reliability (i.e., providing error bound for each search query), applicability to high-dimensional data, and space efficiency. To solve the above issues, we propose HDPView, which creates a differentially private materialized view by well-designed recursive bisected partitioning on an original data cube, i.e., count tensor. Our method searches for block partitioning to minimize the error for the counting query, in addition to randomizing the convergence, by choosing the effective cutting points in a differentially private way, resulting in a less noisy and compact view. Furthermore, we ensure formal privacy guarantee and analytical reliability by providing the error bound for arbitrary counting queries on the materialized views. HDPView has the following desirable properties: (a) Workload independence, (b) Analytical reliability, (c) Noise resistance on high-dimensional data, (d) Space efficiency. To demonstrate the above properties and the suitability for data exploration, we conduct extensive experiments with eight types of range counting queries on eight real datasets. HDPView outperforms the state-of-the-art methods in these evaluations.
Qanaat: A Scalable Multi-Enterprise Permissioned Blockchain System with Confidentiality Guarantees [Download Paper] Mohammad Javad Amiri (University of Pennsylvania)*, Boon Thau Loo (Univ. of Pennsylvania), Divy Agrawal (University of California, Santa Barbara), Amr El Abbadi (UC Santa Barbara) Today's large-scale data management systems need to address distributed applications' confidentiality and scalability requirements among a set of collaborative enterprises. In this paper, we present Qanaat, a scalable multi-enterprise permissioned blockchain system that guarantees the confidentiality of enterprises within and across collaboration workflows. Qanaat presents data collections that enable any subset of enterprises involved in a collaboration workflow to keep their collaboration private from other enterprises. A transaction ordering scheme is also presented to enforce only the necessary and sufficient constraints on transaction ordering to guarantee data consistency. Furthermore, Qanaat supports data consistency across collaboration workflows where an enterprise can participate in different collaboration workflows with different sets of enterprises. Finally, Qanaat presents a suite of consensus protocols to support different types of intra-shard and cross-shard transactions within or across enterprises.
Decentralized Crowdsourcing for Human Intelligence Tasks with Efficient On-Chain Cost [Download Paper] Yihuai Liang (Inha University), Yan Li (Inha University), Byeong-seok Shin (Inha University)* Crowdsourcing for Human Intelligence Tasks (HIT) has been widely used to crowdsource human knowledge, such as image annotation for machine learning. We use a public blockchain to play the role of traditional centralized HIT systems, such that the blockchain deals with cryptocurrency payments and acts as a trustworthy judge to resolve disputes between a worker and a requester in a decentralized setting, preventing false-reporting and free-riding. Our approach neither uses expensive cryptographic tools, such as zero-knowledge proofs, nor sends the worker's answers to the blockchain. Compared with prior works, our approach significantly reduces on-chain cost: it only requires O(1) on-chain storage and O(logN) smart contract computation, where N is the question number of a HIT. Additionally, our approach uses known answers or gold standards to determine the worker's answer quality. To motivate the requester to use honest known answers, the requester cannot learn the worker's answers if the answer quality does not meet the requirement. We further provide formal security definitions for our decentralized HIT and prove security of our construction.
Don't Be a Tattle-Tale: Preventing Leakages through Data Dependencies on Access Control Protected Data [Download Paper] Primal Pappachan (UCI)*, Shufan Zhang (University of Waterloo), Xi He (University of Waterloo), Sharad Mehrotra (U.C. Irvine) We study the problem of answering queries when (part of) the data may be sensitive and should not be leaked to the querier. Simply restricting the computation to non-sensitive part of the data may leak sensitive data through inference based on data dependencies. While inference control from data dependencies during query processing has been studied in the literature, existing solution either detect and deny queries causing leakage, or use a weak security model that only protects against exact reconstruction of the sensitive data. In this paper, we adopt a stronger security model based on full deniability that prevents any information about sensitive data to be inferred from query answers. We identify conditions under which full deniability can be achieved and develop an efficient algorithm that minimally hides non-sensitive cells during query processing to achieve full deniability. We experimentally show that our approach is practical and scales to increasing proportion of sensitive data, as well as, to increasing database size.
ANN Softmax: Acceleration of Extreme Classification Training [Download Paper] Kang Zhao (Alibaba)*, Liuyihan Song (Alibaba Group), Yingya Zhang (Alibaba Group), Pan Pan (Alibaba Group), Yinghui Xu (Alibaba Group), Rong Jin (alibaba group) Thanks to the popularity of GPU and the growth of its computational power, more and more deep learning tasks, such as face recognition, image retrieval and word embedding, can take advantage of extreme classification to improve accuracy. However, it remains a big challenge to train a deep model with millions of classes efficiently due to the huge memory and computation consumption in the last layer. By sampling a small set of classes to avoid the total classes calculation, sampling-based approaches have been proved to be an effective solution. But most of them suffer from the following two issues: i) the important classes are ignored or only partly sampled, such as the methods using random sampling scheme or retrieval techniques of low recall (e.g., locality-sensitive hashing), resulting in the degradation of accuracy; ii) inefficient implementation owing to incompatibility with GPU, like selective softmax. It uses hashing forest to help select classes, but the search process is implemented in CPU. To address the above issues, we propose a new sampling-based softmax called ANN Softmax in this paper. Specifically, we employ binary quantization with inverted file system to improve the recall of important classes. With the help of dedicated kernel design, it can be totally parallelized in mainstream training framework. Then, we find the size of important classes that are recalled by each training sample has a great impact on the final accuracy, so we introduce sample grouping optimization to well approximate the full classes training. Experimental evaluations on two tasks (Embedding Learning and Classification) and ten datasets (e.g., MegaFace, ImageNet, SKU datasets) demonstrate our proposed method maintains the same precision as Full Softmax for different loss objectives, including cross entropy loss, ArcFace, CosFace and D-Softmax loss, with only 1/10 sampled classes, which outperforms the state-of-the-art techniques. Moreover, we implement ANN Softmax in a complete GPU pipeline that can accelerate the training more than 4.3*. Equipped our method with a 256 GPUs cluster, the time of training a classifier of 300 million classes on our SKU-300M dataset can be reduced to ten days.
ETO: Accelerating Optimization of DNN Operators by High-Performance Tensor Program Reuse [Download Paper] Jingzhi Fang (HKUST), Yanyan Shen (Shanghai Jiao Tong University), Yue Wang (Shenzhen Institute of Computing Sciences, Shenzhen University.), Lei Chen (Hong Kong University of Science and Technology)* Recently, deep neural networks (DNNs) have achieved great success in various applications, where low inference latency is important. Existing solutions either manually tune the kernel library or utilize search-based compilation to reduce the operator latency. However, manual tuning requires significant engineering effort, and the huge search space makes the search cost of the search-based compilation unaffordable in some situations. In this work, we propose ETO, a framework for speeding up DNN operator optimization based on reusing the information of performant tensor programs. Specifically, ETO defines conditions for the information reuse between two operators. For operators satisfying the conditions, based on the performant tensor program information of one operator, ETO uses a reuse-based tuner to significantly prune the search space of the other one, and keeps optimization effectiveness at the same time. In this way, for a set of operators, ETO first determines the information reuse relationships among them to reduce the total search time needed, and then tunes the operators either by the backend compiler or by the reuse-based tuner accordingly. ETO further increases the reuse opportunities among the operators by injecting extra operators as bridges between two operators which do not satisfy the reuse conditions. Compared with various existing methods, the experiments show that ETO is effective and efficient in optimizing DNN operators.
A Learned Query Rewrite System using Monte Carlo Tree Search [Download Paper] Xuanhe Zhou (Tsinghua), Guoliang Li (Tsinghua University)*, Chengliang Chai (Tsinghua University), Jianhua Feng (Tsinghua) Query rewrite transforms a SQL query into an equivalent one but with higher performance. However, SQL rewrite is an NP-hard problem, and existing approaches adopt heuristics to rewrite the queries. These heuristics have two main limitations. First, the order of applying different rewrite rules significantly affects the query performance. However, the search space of all possible rewrite orders grows exponentially with the number of query operators and rules and it is rather hard to find the optimal rewrite order. Existing methods apply a pre-defined order to rewrite queries and will fall in a local optimum. Second, different rewrite rules have different benefits for different queries. Existing methods cannot effectively estimate the benefits. To address these challenges, we propose a policy tree based query rewrite framework, where the root is the input query and each node is a rewritten query from its parent. We aim to explore the tree nodes in the \tree to find the optimal rewrite query. We propose to use Monte Carlo Tree Search to explore the policy tree, which navigates the policy tree to efficiently get the optimal node. Moreover, we propose a learning-based model to estimate the expected performance improvement of each rewritten query, which guides the tree search more accurately. We also propose a parallel algorithm that can explore the tree search in parallel in order to improve the performance. Experimental results showed that our method significantly outperformed existing approaches.
Analyzing How BERT Performs Entity Matching [Download Paper] Matteo Paganelli (Università di Modena e Reggio Emilia)*, Francesco Del Buono (University of Modena e Reggio Emilia), Andrea Baraldi (Università di Modena e Reggio Emilia), and Francesco Guerra (University of Modena e Reggio Emilia) State-of-the-art Entity Matching (EM) approaches rely on transformer architectures, such as BERT, for generating highly contextualized embeddings of terms. The embeddings are then used to predict whether pairs of entity descriptions refer to the same real-world entity. BERT-based EM models demonstrated to be effective, but act as black-boxes for the users, who have limited insight into the motivations behind their decisions.In this paper, we perform a multi-facet analysis of the components of pre-trained and fine-tuned BERT architectures applied to an EM task. The main findings resulting from our extensive experimental evaluation are (1) the fine-tuning process applied to the EM task mainly modifies the last layers of the BERT components, but in a different way on tokens belonging to descriptions of matching / non-matching entities; (2) the special structure oft he EM datasets, where records are pairs of entity descriptions is recognized by BERT; (3) the pair-wise semantic similarity of tokens is not a key knowledge exploited by BERT-based EM models.
Towards Communication-efficient Vertical Federated Learning Training via Cache-enabled Local Update [Download Paper] [Scalable Data Science] Fangcheng Fu (Peking University)*, Xupeng Miao (Peking University), Jiawei Jiang (Wuhan University), Huanran Xue (Tencent Inc.), Bin Cui (Peking University) Vertical federated learning (VFL) is an emerging paradigm that allows different parties (e.g., organizations or enterprises) to collaboratively build machine learning models with privacy protection. In the training phase, VFL only exchanges the intermediate statistics, i.e., forward activations and backward derivatives, across parties to compute model gradients. Nevertheless, due to its geo-distributed nature, VFL training usually suffers from the low WAN bandwidth. In this paper, we introduce CELU-VFL, a novel and efficient VFL training framework that exploits the local update technique to reduce the cross-party communication rounds. CELU-VFL caches the stale statistics and reuses them to estimate model gradients without exchanging the ad hoc statistics. Significant techniques are proposed to improve the convergence performance. First, to handle the stochastic variance problem, we propose a uniform sampling strategy to fairly choose the stale statistics for local updates. Second, to harness the errors brought by the staleness, we devise an instance weighting mechanism that measures the reliability of the estimated gradients. Theoretical analysis proves that CELU-VFL achieves a similar sub-linear convergence rate as vanilla VFL training but requires much fewer communication rounds. Empirical results on both public and real-world workloads validate that CELU-VFL can be up to six times faster than the existing works.
Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations [Download Paper] Zichun Huang (Institute of Computing Technology, Chinese Academy of Sciences), Shimin Chen (Chinese Academy of Sciences)* A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3x-18x. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups.
Making RDBMSs Efficient on Graph Workloads Through Predefined Joins [Download Paper] Guodong Jin (Renmin University of China)*, Semih Salihoglu (University of Waterloo) Joins in native graph database management systems (GDBMSs) are predefined to the system as edges, which are indexed in adjacency list indices and serve as pointers. This contrasts with and can be more performant than value-based joins in RDBMSs. Existing approaches to integrate predefined joins into RDBMSs adopt a strict separation of graph and relational data and processors, where a graph-specific processor uses left-deep and index nested loop joins for a subset of joins. This may be suboptimal, and may lead to non-sequential scans of data in some queries. We propose an alternative purely relational approach that uses row IDs (RIDs) of tuples as pointers. Users can predefine equality joins between any two tables, which leads to materializing RIDs in extended tables and optionally in RID indices. Instead of using the RID index to perform the join directly, we use it primarily in hash joins to generate filters that can be passed to scans using sideways information passing, ensuring sequential scans. Our approach does not introduce any graph-specific system components, can execute predefined joins on any join plan, and can improve performance on any workload that contains equality joins that can be predefined. We integrated our approach to DuckDB and call the resulting system GRainDB. We demonstrate that GRainDB far improves the performance of DuckDB on relational and graph workloads with large many-to-many joins, making it competitive with a state-of-the-art GDBMS, and incurs no major overheads otherwise.
Columnar Formats for Schemaless LSM-based Document Stores [Download Paper] Wail Y Alkowaileet (UC Irvine)*, Michael Carey (UC Irvine) In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a columnar-major layout --- making them less performant for analytical workloads than column store relational databases. In this paper, we propose several techniques based on piggy-backing on Log-Structured Merge (LSM) tree events and tailored to document stores to store data in a columnar layout. We first extend the Dremel format, a popular on-disk columnar format for semi-structured data, to comply with document stores' flexible data model. We then introduce a new columnar layout for organizing and storing data in LSM-based storage. We also highlight the potential of using query compilation techniques for document stores, where values' types are known only at runtime. We have implemented and evaluated our techniques to measure their impact on storage, data ingestion, and query performance in Apache AsterixDB. Our experiments show significant performance gains, improving the query execution time by orders of magnitude while minimally impacting ingestion performance.
Spooky: Granulating LSM-Tree Compactions Correctly [Download Paper] Niv Dayan (University of Toronto)*, Tamar Weiss (Pliops), Shmuel Dashevsky (Pliops), Michael Pan (Pliops), Edward Bortnikov (Pliops), Moshe Twitto (Pliops) Modern storage engines and key-value stores have come to rely on the log-structured merge-tree (LSM-tree) as their core data structure. LSM-tree operates by gradually sort-merging data across levels of exponentially increasing capacities in storage. A crucial design dimension of LSM-tree is its compaction granularity. Some designs perform Full Merge, whereby entire levels get compacted at once. Others perform Partial Merge, whereby smaller groups of files with overlapping key ranges are compacted independently. This paper shows that both strategies exhibit serious flaws. With Full Merge, space-amplification is exorbitant. The reason is that while compacting the LSM-tree's largest level, there must be at least twice as much storage space as data to store both the original and new files until the compaction is finished. On the other hand, Partial Merge exhibits excessive write-amplification. The reason is twofold. (1) The files getting compacted typically do not have perfectly overlapping key ranges, and so some non-overlapping data is superfluously rewritten in each compaction. (2) Files with different lifetimes become interspersed within the SSD thus necessitating high overheads for SSD garbage-collection. As the data size grows, these problems grow in magnitude. We introduce Spooky, a novel compaction granulation method to address these problems. Spooky partitions data at the largest level into equally sized files, and it partitions data at smaller levels based on the file boundaries at the largest level. This allows merging one group of perfectly overlapping files at a time to limit space-amplification and compaction overheads. At the same time, Spooky writes and deletes larger though fewer files simultaneously. As a result, files with different lifetimes intersperse to a lesser extent within the SSD and thereby cheapen garbage-collection. We show empirically that Spooky achieves >2x lower space-amplification than Full Merge and >2x lower write-amplification than Partial Merge at the same time.
MP-RW-LSH: An Efficient Multi-Probe LSH Solution to ANNS-L_1 [Download Paper] Huayi Wang (Georgia Institute of Technology)*, Jingfan Meng (Georgia Institute of Technology), Long Gong (Facebook), Jun Xu (Georgia Tech), Mitsunori Ogihara (University of Miami) Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS-L_2, a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L_1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS-L_1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying ''ANNS-L_1 engine'', a new ANNS-E (E for edit distance) solution that beats the state of the art.
Hercules Against Data Series Similarity Search [Download Paper] Karima Echihabi (Mohammed VI Polytechnic University)*, Panagiota Fatourou ( University of Crete), Kostas Zoumpatianos (Snowflake Computing), Themis Palpanas (Université Paris Cité), Houda Benbrahim (ENSIAS, Université Mohammed V de Rabat) In this paper, we propose Hercules, a parallel tree-based technique for exact similarity search on massive disk-based data series collections. We present novel index construction and query answering algorithms that leverage different summarization techniques, carefully schedule costly operations, optimize memory and disk accesses, and exploit the multi-threading and SIMD capabilities of modern hardware to perform CPU-intensive calculations. We demonstrate the superiority and robustness of Hercules with an extensive experimental evaluation against the state-of-the-art techniques, using a variety of synthetic and real datasets, and query workloads of varying difficulty. The results show that Hercules performs up to one order of magnitude faster than the best competitor (which is not always the same). Moreover, Hercules is the only index that outperforms the optimized scan on all scenarios, including the hard query workloads on disk-based datasets.
ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain [Download Paper] Shangdi Yu (Massachusetts Institute of Technology)*, Yiqiu Wang (Massachusetts Institute of Technology), Yan Gu (UC Riverside), Laxman Dhulipala (MIT CSAIL), Julian Shun (MIT) This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.
xFraud: Explainable Fraud Transaction Detection [Download Paper] Susie Xi Rao (ETH)*, Shuai Zhang (ETH Zurich), Zhichao Han (Ebay), Zitao Zhang (eBay), Wei Min (ebay), Zhiyao Chen (eBay), Yinan Shan (Ebay), Yang Zhao (Ebay), Ce Zhang (ETH) At online retail platforms, it is crucial to actively detect the risks of transactions to improve customer experience and minimize financial loss. In this work, we propose xFraud, an explainable fraud transaction prediction framework which is mainly composed of a detector and an explainer. The xFraud detector can effectively and efficiently predict the legitimacy of incoming transactions. Specifically, it utilizes a heterogeneous graph neural network to learn expressive representations from the informative heterogeneously typed entities in the transaction logs. The explainer in xFraud can generate meaningful and human-understandable explanations from graphs to facilitate further processes in the business unit. In our experiments with xFraud on real transaction networks with up to 1.1 billion nodes and 3.7 billion edges, xFraud is able to outperform various baseline models in many evaluation metrics while remaining scalable in distributed settings. In addition, we show that xFraud explainer can generate reasonable explanations to significantly assist the business analysis via both quantitative and qualitative evaluations.
AutoCTS: Automated Correlated Time Series Forecasting [Download Paper] Xinle Wu (Aalborg Universigy)*, Dalin Zhang (Aalborg University), Chenjuan Guo (Aalborg University), Chaoyang He (University of Southern California), Bin Yang (Aalborg University), Christian S Jensen (Aalborg University) Correlated time series (CTS) forecasting plays an essential role in many cyber-physical systems, where multiple sensors emit time series that capture interconnected processes. Solutions based on deep learning that deliver state-of-the-art CTS forecasting performance employ a variety of spatio-temporal (ST) blocks that are able to model temporal dependencies and spatial correlations among time series. However, two challenges remain. First, ST-blocks are designed manually, which is time consuming and costly. Second, existing forecasting models simply stack the same ST-blocks multiple times, which limits the model potential. To address these challenges, we propose AutoCTS that is able to automatically identify highly competitive ST-blocks as well as forecasting models with heterogeneous ST-blocks connected using diverse topologies, as opposed to the same ST-blocks connected using simple stacking. Specifically, we design both a micro and a macro search space to model possible architectures of ST-blocks and the connections among heterogeneous ST-blocks, and we provide a search strategy that is able to jointly explore the search spaces to identify optimal forecasting models. Extensive experiments on eight commonly used CTS forecasting benchmark datasets justify our design choices and demonstrate that AutoCTS is capable of automatically discovering forecasting models that outperform state-of-the-art human-designed models.
Popularity Prediction for Social Media over Arbitrary Time Horizons [Download Paper] Daniel Haimovich (Facebook), Dmytro Karamshuk (Facebook)*, Thomas J. Leeper (Facebook), Evgeniy Riabenko (Facebook), Milan Vojnovic (London School of Economics) Predicting the popularity of social media content in real time requires approaches that efficiently operate at global scale. Popularity prediction is important for many applications, including detection of harmful viral content to enable timely content moderation. The prediction task is difficult because views result from interactions between user interests, content features, resharing, feed ranking, and network structure. We consider the problem of accurately predicting popularity both at any given prediction time since a content item's creation and for arbitrary time horizons into the future. In order to achieve high accuracy for different prediction time horizons, it is essential for models to use static features (of content and user) as well as observed popularity growth up to prediction time. We propose a feature-based approach based on a self-excited Hawkes point process model, which involves prediction of the content's popularity at one or more reference horizons in tandem with a point predictor of an effective growth parameter that reflects the timescale of popularity growth. This results in a highly scalable method for popularity prediction over arbitrary prediction time horizons that also achieves a high degree of accuracy, compared to several leading baselines, on a dataset of public page content on Facebook over a two-month period, covering billions of content views and hundreds of thousands of distinct content items. The model has shown competitive prediction accuracy against a strong baseline that consists of separately trained models for specific prediction time horizons.
Blueprint: a constraint-solving approach for document extraction [Download Paper] Andrey Mishchenko (University Of Michigan), Dominique Danco (University of Amsterdam), Abhilash Jindal (IIT Delhi)*, Adrian Blue (Instabase) Blueprint is a declarative domain-specific language for document extraction. Users describe document layout using spatial, textual, semantic, and numerical fuzzy constraints, and the language runtime extracts the field-value mappings that best satisfy the constraints in a given document. We used Blueprint to develop several document extraction solutions in a commercial setting. This approach to the extraction problem proved powerful. Concise Blueprint programs were able to generate good accuracy on a broad set of use cases. However, a major goal of our work was to build a system that non-experts, and in particular non-engineers, could use effectively, and we found that writing declarative fuzzy constraint-based extraction programs was not intuitive for many users: a large up-front learning investment was required to be effective, and debugging was often challenging. To address these issues, we developed a no-code IDE for Blueprint, called Studio, as well as program synthesis functionality for automatically generating Blueprint programs from training data, which could be created by labeling document samples in our IDE. Overall, the IDE significantly improved the Blueprint development experience and the results users were able to achieve. In this paper, we discuss the design, implementation, and deployment of Blueprint and Studio. We compare our system with a state-of-the-art deep-learning based extraction tool and show that our system can achieve comparable accuracy results, with comparable development time, for appropriately-chosen use cases, while providing better interpretability and debuggability.
08Sep
COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression [Download Paper] Sian Jin (Washington State University), Chengming Zhang (Washington State University), Xintong Jiang (McGill University), Yunhe Feng (University of Washington), Hui Guan (University of Massachusetts, Amherst), Guanpeng Li (University of Iowa), Shuaiwen Song (University of Sydney), Dingwen Tao (Washington State University)* Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Different from the state-of-the-art solutions that adopt image-based lossy compressors (such as JPEG) to compress the activation data, our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. We evaluate our design against state-of-the-art solutions with five widely-adopted CNNs and ImageNet dataset. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss.
Learned Cardinality Estimation: A Design Space Exploration and a Comparative Evaluation [EA&B] [Download Paper] Ji Sun (Tsinghua University), Jintao Zhang (Tsinghua University), Zhaoyan Sun (Tsinghua University), Guoliang Li (Tsinghua University)*, Nan Tang (Qatar Computing Research Institute, HBKU) Cardinality estimation -- which predicts the result number of an SQL query -- is core to the query optimizers of database management systems. Non-learned methods, especially based on histograms and samplings, have been the predominant methods for decades and are widely used in commercial and open-source DBMSs. Nevertheless, histograms and samplings can only be used to summarize one or few columns, which fall short of capturing the joint data distribution over an arbitrary combination of columns, because of the oversimplification of histograms and samplings over the original relational table(s). Consequently, these traditional methods typically make bad predictions for hard cases such as queries over multiple columns, with multiple predicates, and joins between multiple tables. Recently, learned cardinality estimators have been widely studied. Because these learned estimators can better capture the data distribution and query characteristics, empowered by the recent advance of (deep learning) models, they outperform non-learned methods on many cases. The goals of this paper are to provide a design space exploration of learned cardinality estimators, and to have a comprehensive comparison of the state-of-the-art learned approaches so as to provide a guidance for practitioners to decide what method to use under various practical scenarios.
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers [Download Paper] Youjie Li (UIUC)*, Amar Phanishayee (Microsoft Research), Derek Murray (Lacework), Jakub Tarnawski (Microsoft Research), Nam Sung Kim (University of Illinois at Urbana-Champaign) Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training massive models efficiently on a single commodity server. Across various massive DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.
Tiresias: Enabling Predictive Autonomous Storage and Indexing [Download Paper] Michael Abebe (University of Waterloo)*, Horatiu Lazu (University of Waterloo), Khuzaima Daudjee (University of Waterloo) To efficiently store and query a DBMS, administrators must select storage and indexing configurations. For example, one must decide whether data should be stored in rows or columns, in-memory or on disk, and which columns to index. These choices can be challenging to make for workloads that are mixed requiring hybrid transactional and analytical processing (HTAP) support. There is growing interest in system designs that can adapt how data is stored and indexed to execute these workloads efficiently. We present Tiresias, a predictor that learns the cost of data accesses and predicts their latency and likelihood under different storage scenarios. Tiresias makes these predictions by collecting observed latencies and access histories to build predictive models in an online manner, enabling autonomous storage and index adaptation. Experimental evaluation shows the benefits of predictive adaptation and the trade-offs for different predictive techniques.
CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm [Download Paper] Jiaoyi Zhang (Tsinghua University), Yihan Gao (Tsinghua University)* Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes. In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2¡Á/1.9¡Á speedup compared to B+ Tree/ALEX, while using only about 0.77¡Á memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2¡Á over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.
Towards Distribution-aware Query Answering in Data Markets [Download Paper] [Vision] Abolfazl Asudeh (University of Illinois at Chicago)*, Fatemeh Nargesian (University of Rochester) Addressing the increasing demand for data exchange has led to the development of data markets that facilitate transactional interactions between data buyers and data sellers. Still, cost-effective and distribution-aware query answering is a substantial challenge in these environments. In this paper, while differentiating between different types of data markets, we take the initial steps towards addressing this challenge. In particular, we envision a unified query answering framework and discuss its functionalities. Our framework enables integrating data from different sources in a data market into a dataset that meets user-provided schema and distribution requirements cost-effectively. In order to facilitate consumers' query answering, our system discovers data views in the form of join-paths on relevant data sources, defines a get-next operation to query views, and estimates the cost of get-next on each view. The query answering engine then selects the next views to sample sequentially to collect the output data. Depending on the knowledge of the system from the underlying data sources, the view selection problem can be modeled as an instance of a multi-arm bandit or coupon collector's problem.
Rearchitecting In-Memory Object Stores for Low Latency [Download Paper] Danyang Zhuo (Duke University)*, Kaiyuan Zhang (University of Washington), Zhuohan Li (UC Berkeley), Siyuan Zhuang (UC Berkeley), Stephanie Wang (UC Berkeley), Ang Chen (Rice University), Ion Stoica (UC Berkeley) Low latency is increasingly critical for modern workloads, to the extent that compute functions are explicitly scheduled to be co-located with their in-memory object stores for faster access. However, the traditional object store architecture mandates that clients interact with the server via inter-process communication (IPC). This poses a significant performance bottleneck for low-latency workloads. Meanwhile, in many important emerging AI workloads, such as parallel tree search and reinforcement learning, all the worker processes accessing the object store belong to a single user. We design Lightning, an in-memory object store rearchitected for modern, low-latency workloads in a single-user, multi-process setting. Lightning departs from the traditional design by adopting a shared memory model, enabling clients to directly access the object store without IPC boundary. Instead, client isolation is achieved by a novel integration of Intel Memory Protect Keys (MPK) hardware, transaction logging, and formal verification. Our evaluations show that Lightning outperforms state-of-the-art in-memory object stores by up to 9.0x on five standard NoSQL workloads and up to 4.5x in scaling up a Python tree search program. Lightning improves the throughput of a popular reinforcement learning framework that uses an in-memory object store for data sharing by up to 40%.
DARLING: Data-Aware Load Shedding in Complex Event Processing Systems [Download Paper] Koral Chapnik (Technion)*, Ilya Kolchinsky (Technion), Assaf Schuster (Technion) Complex event processing (CEP) is widely employed to detect user-defined combinations, or patterns, of events in massive streams of incoming data. Numerous applications such as healthcare, fraud detection, and more, use CEP technologies to capture critical alerts, threats, or vital notifications. This requires that the technology meet real-time detection constraints. Multiple optimization techniques have been developed to minimize the processing time for CEP, including parallelization techniques, pattern rewriting, and more. However, these techniques may not suffice or may not be applicable when an unpredictable peak in the input event stream exceeds the system capacity. In such cases, one immediate possible solution is to drop some of the load in a technique known as load shedding. We present a novel load shedding mechanism for real-time complex event processing. Our approach uses statistics that are gathered to detect overload. The solution makes data-driven load shedding decisions to drop the less important events such that we preserve a given latency bound while minimizing the degradation in the quality of results. An extensive experimental evaluation on a broad set of real-life patterns and datasets demonstrates the superiority of our approach over the state-of-the-art techniques.
CORE: a Complex Event Recognition Engine [Download Paper] Marco Bucchi (PUC Chile), Alejandro Grez (PUC Chile), Andres F Quintana (PUC), Cristian Riveros (PUC Chile)*, Stijn Vansummeren (Hasselt University) Complex Event Recognition (CER) systems are a prominent technology for finding user-defined query patterns over large data streams in real time. CER query evaluation is known to be computationally challenging, since it requires maintaining a set of partial matches, and this set quickly grows super-linearly in the number of processed events. We present CORE, a novel COmplex event Recognition Engine that focuses on the efficient evaluation of a large class of complex event queries, including time windows as well as the partition-by event correlation operator. This engine uses a novel evaluation algorithm that circumvents the super-linear partial match problem: under data complexity, it takes constant time per input event to maintain a data structure that compactly represents the set of partial matches and, once a match is found, the query results may be enumerated from the data structure with output-linear delay. We experimentally compare CORE against state-of-the-art CER systems on real-world data. We show that (1) CORE's performance is stable with respect to both query and time window size, and (2) CORE outperforms the other systems by up to five orders of magnitude on different workloads.
Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments [Download Paper] Huan Li (Aalborg University)*, Lanjing Yi (Southern University of Science and Technology), Bo Tang (Southern University of Science and Technology), Hua Lu (Roskilde University), Christian S Jensen (Aalborg University) Underlying many types of data analytics, a spatiotemporal quantile monitoring (SQM) query continuously returns the quantiles of a dataset observed in a spatiotemporal range. In this paper, we study SQM in an Internet of Things (IoT) based edge computing environment, where concurrent SQM queries share the same infrastructure asynchronously. To minimize query latency while providing result accuracy guarantees, we design a processing framework that virtualizes edge-resident data sketches for quantile computing. In the framework, a coordinator edge node manages edge sketches and synchronizes edge sketch processing and query executions. The coordinator also controls the processed data fractions of edge sketches, which helps to achieve the optimal latency with error-bounded results for each single query. To support concurrent queries, we employ a grid to decompose queries into subqueries and process them efficiently using shared edge sketches. We also devise a relaxation algorithm to converge to optimal latencies for those subqueries whose result errors are still bounded. We evaluate our proposals using two high-speed streaming datasets in a simulated IoT setting with edge nodes. The results show that our proposals achieve efficient, scalable, and error-bounded SQM.
Hardware Acceleration of Compression and Encryption in SAP HANA [Download Paper] [Best Industry Paper] Monica Chiosa (ETH Zürich)*, Fabio Maschi (ETHZ), Ingo Müller (Google), Gustavo Alonso (ETHZ), Norman May (SAP SE) With the advent of cloud computing, where computational resources are expensive and data movement needs to be secured and minimized, database management systems need to reconsider their architecture to accommodate such requirements. In this paper, we present our analysis, design and evaluation of an FPGA-based hardware accelerator for offloading compression and encryption for SAP HANA, SAP¡¯s Software-as-a-Service (SaaS) in-memory database. Firstly, we identify expensive data-transformation operations in the I/O path. Then we present the design details of a system consisting of compression followed by different types of encryption to accommodate different security levels, and identify which combinations maximize performance. We also analyze the performance benefits of offloading decryption to the FPGA followed by decompression on the CPU. The experimental evaluation using SAP HANA traces shows that analytical engines can benefit from FPGA hardware offloading. The results identify a number of important trade-offs (e.g., the system can accommodate low-latency secured transactions to high-performance use cases or offer lower storage cost by also compressing payloads for less critical use cases), and provide valuable information to researchers and practitioners exploring the nascent space of hardware accelerators for database engines.
Troubles with Nulls, Views from the Users [Download Paper] Etienne Jr Toussaint (University of Edinburgh), Paolo Guagliardo (University of Edinburgh)*, Leonid Libkin (University of Edinburgh, School of Informatics), Juan Sequeda (data.world) Incomplete data, in the form of null values, has been extensively studied since the inception of the relational model in the 1970s. Anecdotally, one hears that the way in which SQL, the standard language for relational databases, handles nulls creates a myriad of problems in everyday applications of database systems. To the best of our knowledge, however, the actual shortcomings of SQL in this respect, as perceived by database practitioners, have not been systematically documented, and it is not known if existing research results can readily be used to address the practical challenges. Our goal is to collect and analyze the shortcomings of nulls and their treatment by SQL, and to re-evaluate existing research in this light. To this end, we designed and conducted a survey on the everyday usage of null values among database users. From the analysis of the results we reached two main conclusions. First, null values are ubiquitous and relevant in real-life scenarios, but SQL's features designed to deal with them cause multiple problems. The severity of these problems varies depending on the SQL features used, and they cannot be reduced to a single issue. Second, foundational research on nulls is misdirected and has been addressing problems of limited practical relevance. We urge the community to view the results of this survey as a way to broaden the spectrum of their researches and further bridge the theory-practice gap on null values.
Rewriting the Infinite Chase [Download Paper] Michael Benedikt (Oxford University)*, Maxime Buron (Inria), Stefano Germano (University of Oxford), Kevin Kappelmann (Technical University of Munich), Boris Motik (University of Oxford) Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase---a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where GTGDs are transformed to a Datalog program that entails the same base facts on each base instance. We show that the rewriting can be seen as containing "shortcut" rules that circumvent certain chase steps, we present several algorithms that compute the rewriting by simulating specific types of chase steps, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.
MT-Teql: Evaluating and Augmenting Neural NLIDB on Real-world Linguistic and Schema Variations [Download Paper] [Experiments, Analysis & Benchmark] Pingchuan Ma (HKUST)*, Shuai Wang (HKUST) Natural Language Interface to Database (NLIDB) translates human utterances into SQL queries and enables database interactions for non-expert users. Recently, neural network models have become a major approach to implementing NLIDB. However, neural NLIDB faces challenges due to variations in natural language and database schema design. For instance, one user intent or database conceptual model can be expressed in various forms. However, existing benchmarks, using hold-out datasets, cannot provide thorough understanding of how good neural NLIDBs really are in real-world situations and its robustness against such variations. A key difficulty is to annotate SQL queries for inputs under real-world variations, requiring considerable manual effort and expert knowledge. To systematically assess the robustness of neural NLIDBs without extensive manual effort, we propose MT-Teql, a unified framework to benchmark NLIDBs against real-world language and schema variations. Inspired by recent advances in DBMS metamorphic testing, MT-Teql implements semantics-preserving transformations on utterances and database schemas to generate their variants. NLIDBs can thus be examined for robustness utilizing utterances/schemas and their variants without requiring manual intervention. We benchmarked nine neural NLIDB models using a total of 62,430 test inputs. MT-Teql successfully identified 15,433 defects. We categorized errors exposed by MT-Teql and analyzed potential root causes of inconsistencies. We further conducted a user study to show how MT-Teql can assist developers to systematically assess NLIDBs. We show that the transformed (error-triggering) inputs can be leveraged to augment neural NLIDBs and successfully eliminate $46.5\%(\pm5.0\%)$ errors made by popular neural NLIDBs without compromising their accuracy on standard benchmarks. We summarize lessons from the study that can provide insights to select and design neural NLIDB that fits particular usage scenarios.
AcX: System, Techniques, and Experiments for Acronym Expansion [Download Paper] João L. M. Pereira (INESC-ID and IST, Universidade de Lisboa, and University of Amsterdam)*, João Casanova (Hitachi Vantara), Helena Galhardas (INESC-ID and IST, Universidade de Lisboa), and Dennis Shasha (NYU, USA) In this information-accumulating world, each of us must learn continuously. To participate in a new field, or even a sub-field, one must be aware of the terminology including the acronyms that specialists know so well, but newcomers do not. Building on state-of-the art acronym tools, our end-to-end acronym expander system called AcX takes a document, identifies its acronyms, and suggests expansions that are either found in the document or appropriate given the subject matter of the document. As far as we know, AcX is the first open source and extensible system for acronym expansion that allows mixing and matching of different inference modules. As of now, AcX works for English, French, and Portuguese with other languages in progress. This paper describes the design and implementation of AcX, proposes three new acronym expansion benchmarks, compares state-of-the-art techniques on them, and proposes ensemble techniques that improve on any single technique. Finally, the paper evaluates the performance of AcX in end-to-end experiments on a human-annotated dataset of Wikipedia documents. Our experiments show that human performance is still better than the best automated approaches. Thus, achieving Acronym Expansion at a human level is still a rich and open challenge.
CDI-E: An Elastic Cloud Service for Data Engineering [Download Paper] [Industry] Prakash C Das (Informatica), Shivangi srivastava (Informatica), Valentin Moskovich (Informatica)*, Anmol Chaturvedi (Informatica), Anant Mittal (Informatica), Yongqin Xiao (Informatica), Mosharaf Chowdhury (University of Michigan, Ann Arbor) We live in the gilded age of data-driven computing. With public clouds offering virtually unlimited amount of compute and storage, enterprises collecting data about every aspect of their businesses, and advances in analytics and machine learning technologies, data driven decision making is now timely, cost-effective, and therefore, pervasive. Alas, only a handful of power users can wield today¡¯s powerful data engineering tools. For one thing, most solutions require knowledge of specific programming interfaces or libraries. Furthermore, running them requires complex configurations and knowledge of the underlying cloud for cost-effectiveness. We decided that a fundamental redesign is in order to democratize data engineering for the masses at cloud scale. The result is Informatica Cloud Data Integration - Elastic (CDI-E). Since the early 1990s, Informatica has been a pioneer and industry leader in building no-code data engineering tools. Non-experts can express complex data engineering tasks using a graphical user interface (GUI). Informatica CDI-E is built to incorporate the simplicity of GUI in the design layer with an elastic and highly scalable runtime to handle data in any format without little to no user input using automated optimizations. Users upload their data to the cloud in any format and can immediately use them in conjunction with their data management and analytics tools of choice using CDI-E GUI. Implementation began in the Spring of 2017, and Informatica CDIE has been generally available since the Summer of 2019. Today, CDI-E is used in production by a growing number of small and large enterprises to make sense of data in arbitrary formats. In this paper, we describe the architecture of Informatica CDI-E and its novel no-code data engineering interface. The paper highlights some of the key features of CDI-E: simplicity without loss in productivity and extreme elasticity. It concludes with lessons we learned and an outlook of the future.
An Experimental Evaluation and Investigation of Waves of Misery in R-trees [Download Paper] Lu Xing (Purdue University), Walid G Aref (Purdue)*, Jianguo Wang (Purdue University), Bo-cheng Chu (Purdue), Tong An (Purdue), Eric Lee (), Ahmed Aly (Facebook), Ahmed Mahmood (Purdue University) Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads. Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update operations deteriorate when a wave of misery takes place, but is more predictable between the waves. This paper investigates the presence or lack of waves of misery in several R-tree variants, and studies the extent of which these waves impact the performance of each variant. Interestingly, although having poorer query performance, the Linear and Quadratic R-trees are found to be more resilient to waves of misery than both the Hilbert and R*-trees. This paper presents several techniques to reduce the impact in performance of the waves of misery for the Hilbert and R*-trees. One way to eliminate waves of misery is to force node splits to take place at regular times before nodes become full to achieve deterministic performance. The other way is that upon splitting a node, do not split it evenly but rather at different node utilization factors. This allows leaf nodes not to fill at the same pace. We study the impact of two new techniques to mitigate waves of misery after the tree index has been constructed, namely Regular Elective Splits (RES, for short) and Unequal Random Splits (URS, for short). Our experimental investigation highlights the trade-offs in performance of the introduced techniques and the pros and cons of each technique.
Frost: A Platform for Benchmarking and Exploring Data Matching Results [Download Paper] [Industry] Martin Graf (Hasso Plattner Institute), Lukas Laskowski (Hasso Plattner Institute), Florian Papsdorf (Hasso Plattner Institute), Florian Sold (Hasso Plattner Institute)*, Roland Gremmelspacher (SAP SE), Felix Naumann (Hasso Plattner Institute), Fabian Panse (Universität Hamburg) “Bad” data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates – multiple but different representations of the same real-world entities are among the main reasons for poor data quality, so finding and configuring the right deduplication solution is essential. Existing data matching benchmarks focus on the quality of matching results and neglect other important factors, such as business requirements. Additionally, they often do not support the exploration of data matching results. To address this gap between the mere counting of record pairs vs. a comprehensive means to evaluate data matching solutions, we present the Frost platform. It combines existing benchmarks, established quality metrics, cost and effort metrics, and exploration techniques, making it the first platform to allow systematic exploration to understand matching results. Frost is implemented and published in the open-source application Snowman, which includes the visual exploration of matching results, as shown in Figure 1."
On Shapley Value in Data Assemblage Under Independent Utility [Download Paper] Xuan Luo (Simon Fraser University)*, Jian Pei (Simon Fraser University), Zicun Cong (Simon Fraser University), Cheng Xu (Simon Fraser University) In many applications, an organization may want to acquire data from many data owners. Data marketplaces allow data owners to produce data assemblage needed by data buyers through coalition. To encourage coalitions to produce data, it is critical to allocate revenue to data owners in a fair manner according to their contributions. Although in literature Shapley fairness and alternatives have been well explored to facilitate revenue allocation in data assemblage, computing exact Shapley value for many data owners and large assembled data sets through coalition remains challenging due to the combinatoric nature of Shapley value. In this paper, we explore the decomposability of utility in data assemblage by formulating the independent utility assumption. We argue that independent utility enjoys many applications. Moreover, we identify interesting properties of independent utility and develop fast computation techniques for exact Shapley value under independent utility. Our experimental results on a series of benchmark data sets show that our new approach not only guarantees the exactness of Shapley value, but also achieves faster computation by orders of magnitudes.
In-Page Shadowing and Two-Version Timestamp Ordering for Mobile DBMSs [Download Paper] Duy Lam Nguyen (Sungkyunkwan University), Sang Won Lee (Sungkyunkwan University), Beomseok Nam (Sungkyunkwan University)* Increasing the concurrency level in mobile database systems has not received much attention, mainly because the concurrency requirements of mobile workloads has been regarded to be low. Contrary to popular belief, mobile workloads require higher concurrency. In this work, we propose novel journaling and concurrency mechanisms for mobile DBMSs, both of which build upon one common concept - In-Page Shadowing (IPS). We design and implement a novel In-Page Shadowing recovery method for SQLite to resolve the journaling of journal anomaly, which is known to quadruple the I/O traffic in mobile devices. IPS unions the previous and the next versions of a database page in the same physical page. Using the consolidated two versions of database page, we design Two-Version Timestamp-Ordering (2VTO) protocol that enables non-blocking reads as in multi-version concurrency control, but reduces the garbage collection overhead. Designed with mobile environments in mind, IPS and 2VTO are high-performant and resource-efficient transactional solutions. Our performance study shows that IPS and 2VTO outperform state-of-the-art logging methods and an optimistic concurrency control protocol for real mobile workloads.
Threshold Queries in Theory and in the Wild [Download Paper] [Best Regular Research Paper Runner Ups] Angela Bonifati (Univ. of Lyon), Stefania Dumbrava (ENSIIE), George Fletcher (Eindhoven University of Technology), Jan Hidders (University of London, Birbeck), Matthias Hofer (University of Bayreuth), Wim Martens (University of Bayreuth), Filip Murlak (University of Warsaw, Poland), Joshua Shinavier (Uber), Sławek Staworko (University of Lille)*, Dominik Tomaszuk (University of Bialystok) Threshold queries are an important class of queries that only require computing or counting answers up to a specified threshold value. To the best of our knowledge, threshold queries have been largely disregarded in the research literature, which is surprising considering how common they are in practice. In this paper, we present a deep theoretical analysis of threshold query evaluation and show that thresholds can be used to significantly improve the asymptotic bounds of state-of-the-art query evaluation algorithms. We also empirically show that threshold queries are significant in practice. In surprising contrast to conventional wisdom, we found important scenarios in real-world data sets in which users are interested in computing the results of queries up to a certain threshold, independent of a ranking function that orders the query results by importance.
PerMA-Bench: Benchmarking Persistent Memory Access [Download Paper] [Experiments, Analyses & Benchmarks] Lawrence Benson (Hasso Plattner Institute, University of Potsdam)*, Leon Papke (Hasso Plattner Institute), Tilmann Rabl (HPI, University of Potsdam) Persistent memory's (PMem) byte-addressability and persistence at DRAM-like speed with SSD-like capacity have the potential to cause a major performance shift in database storage systems. With the availability of Intel Optane DC Persistent Memory, initial benchmarks evaluate the performance of real PMem hardware. However, these results apply to only a single server and it is not yet clear how workloads compare across different PMem servers. In this paper, we propose PerMA-Bench, a configurable benchmark framework that allows users to evaluate the bandwidth, latency, and operations per second for customizable database-related PMem access. Based on PerMA-Bench, we perform an extensive evaluation of PMem performance across four different server configurations, containing both first- and second-generation Optane, with additional parameters such as DIMM power budget and number of DIMMs per server. We validate our results with existing systems and show the impact of low-level design choices. We conduct a price-performance comparison that shows while there are large differences across Optane DIMMs, PMem is generally competitive with DRAM. We discuss our findings and identify eight general and implementation-specific aspects that influence PMem performance and should be considered in future work to improve PMem-aware designs.
Endure: A Robust Tuning Paradigm for LSM Trees Under Workload Uncertainty [Download Paper] Andy Huynh (Boston University)*, Harshal Chaudhari (Boston University), Evimaria Terzi (Boston University), Manos Athanassoulis (Boston University) Log-Structured Merge trees (LSM trees) are increasingly used as the storage engines behind several data systems, frequently deployed in the cloud. Similar to other database architectures, LSM trees consider information about the expected workload (e.g., reads vs. writes, point vs. range queries) to optimize their performance via tuning. However, operating in a shared infrastructure like the cloud comes with workload uncertainty due to the fast-evolving nature of modern applications. Systems with static tuning discount the variability of such hybrid workloads and hence provide an inconsistent and overall suboptimal performance. To address this problem, we introduce ENDURE - a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policies, size-ratio, and memory allocation on the overall performance. ENDURE considers a robust formulation of the throughput maximization problem and recommends a tuning that maximizes the worst-case throughput over the neighborhood of each expected workload. Additionally, an uncertainty tuning parameter controls the size of this neighborhood, thereby allowing the output tunings to be conservative or optimistic. Through both model-based and extensive experimental evaluations of ENDURE in the state-of-the-art LSM-based storage engine, RocksDB, we show that the robust tuning methodology consistently outperforms classical tuning strategies. The robust tunings output by ENDURE lead up to a 5x improvement in throughput in the presence of uncertainty. On the flip side, ENDURE tunings have negligible performance loss when the observed workload exactly matches the expected one.
Ganos: A Multidimensional, Dynamic, and Scene-Oriented Cloud-Native Spatial Database Engine [Download Paper] [Industry] Jiong Xie (Alibaba Group), Zhen chen (Alibaba Corp.), Jianwei Liu (alibaba), Fang Wang (Alibaba), Feifei Li (Alibaba Group), Zhida Chen (Alibaba Group)*, Yinpei Liu (Alibaba Group), Songlu Cai (Alibaba Group), Zhenhua Fan (Alibaba-inc), Fei Xiao (Alibaba Group), Yue Chen (Alibaba group) Recently, the trend of developing digital twins for smart cities has driven a need for managing large-scale multidimensional, dynamic, and scene-oriented spatial data. Due to larger data scale and more complex data structure, queries over such data are more complicated and expensive than those on traditional spatial data, which poses challenges to the system efficiency and deployment costs. The existing spatial databases have limited support in both data types and operations. Therefore, a new-generation spatial database with excellent performance and effective deployment costs is needed. This paper presents Ganos, a cloud-native spatial database engine of PolarDB for PostgreSQL that is developed by Alibaba Cloud, to efficiently manage multidimensional, dynamic, and scene-oriented spatial data. Ganos models 3D space and spatio-temporal dynamics as first-class citizens. Also, it natively supports spatial/spatio-temporal data types such as 3DMesh, Trajectory, Raster, PointCloud, etc. Besides, it implements a novel extended-storage mechanism that utilizes cloud-native object storage to reduce storage costs and enable uniform operations on the data in different storages. To facilitate processing "big" queries, Ganos extends PolarDB and provides spatial-oriented multi-level parallelism under the architecture of decoupling compute from storage in cloud-native databases, which achieves elasticity and excellent query performance. We demonstrate Ganos in real-life case studies. The performance of Ganos is evaluated using real datasets, and promising results are obtained. Finally, based on the extensive deployment and application of Ganos, the lessons learned from our customers and the expectations of modern cloud applications for new spatial database features are discussed.
VRE: A Versatile, Robust, and Economical Trajectory Data System [Download Paper] [Industry] Hai Lan (RMIT University), Jiong Xie (Alibaba Group), Zhifeng Bao (RMIT University)*, Feifei Li (Alibaba Group), Wei Tian (Alibaba Group), Fang Wang (Alibaba), Sheng Wang (Alibaba Group), Ailin Zhang (Alibaba) Managing massive trajectory data from various moving objects has always been a demanding task. A desired trajectory data system should be versatile in its supported query types and distance functions, of lowstorage cost, and be consistently efficient on processing trajectory data of different properties. Unfortunately, none of the existing systems can meet the above three criteria at the same time. To this end, we propose VRE, a versatile, robust, and economical trajectory data system.VRE separates the storage from the processing. In the storage layer, we propose a novel segment-based storage model that takes advantage of the strengths of both point-based and trajectory-based storage models. VRE supports these three storage models and ten storage schemas upon them. With the secondary index, VRE reduces the storage cost up to 3x. In the processing layer, we first propose a two-stage processing framework and a pushdown strategy to alleviate full trajectory transmission cost. Then, we design a unified pruning strategy for five widely used trajectory distance functions and numerous tailored processing algorithms for five advanced queries. Extensive experiments are conducted to verify the design choice and efficiency of VRE, from which we present some key insights that are crucial to both VRE and future trajectory system¡¯s design.
NFL: Robust Learned Index via Distribution Transformation [Download Paper] Shangyu Wu (City University of Hong Kong`)*, Yufei Cui (City University of Hong Kong), Jinghuan Yu (City University of Hong Kong), Xuan Sun (City University of Hong Kong), Tei-wei Kuo (City University of Hong Kong), Chun Jason Xue (City University of Hong Kong) Recent works on learned index open a new direction for the indexing field. The key insight of the learned index is to approximate the mapping between keys and positions with piece-wise linear functions. Such methods require partitioning key space for a better approximation. Although lots of heuristics are proposed to improve the approximation quality, the bottleneck is that the segmentation overheads could hinder the overall performance. This paper tackles the approximation problem by applying a \textit{distribution transformation} to the keys before constructing the learned index. A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys. For effective distribution transformation, we propose a Numerical Normalizing Flow (Numerical NF). Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI). To validate the performance, comprehensive evaluations are conducted on both synthetic and real-world workloads, which shows that the proposed NFL produces the highest throughput and the lowest tail latency compared to the state-of-the-art learned indexes.
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model [Download Paper] [Scalable Data Science] Gyeong-in Yu (Seoul National University)*, Saeed Amizadeh (Microsoft), Sehoon Kim (University of California, Berkeley), Artidoro Pagnoni (Carnegie Mellon University), Ce Zhang (ETH), Byung-gon Chun (Seoul National University), Markus Weimer (Microsoft), Matteo Interlandi (Microsoft) While deep neural networks (DNNs) have shown to be successful in several domains like computer vision, non-DNN models such as linear models and gradient boosting trees are still considered state-of-the-art over tabular data. When using these models, data scientists often author machine learning (ML) pipelines: DAG of ML operators comprising data transforms and ML models, whereby each operator is sequentially trained one-at-a-time. Conversely, when training DNNs, layers composing the neural networks are simultaneously trained using backpropagation. In this paper, we argue that the training scheme of ML pipelines is sub-optimal because it tries to optimize a single operator at a time thus losing the chance of global optimization. We therefore propose WindTunnel: a system that translates a trained ML pipeline into a pipeline of neural network modules and jointly optimizes the modules using backpropagation. We also suggest translation methodologies for several non-differentiable operators such as gradient boosting trees and categorical feature encoders. Our experiments show that fine-tuning of the translated WindTunnel pipelines is a promising technique able to increase the final accuracy.
DeepEverest: Accelerating Declarative Top-K Queries for Deep Neural Network Interpretation [Download Paper] Dong He (University of Washington)*, Maureen Daum (University of Washington), Walter Cai (University of Washington), Magdalena Balazinska (UW) We design, implement, and evaluate DeepEverest, a system for the efficient execution of interpretation by example queries over the activation values of a deep neural network. DeepEverest consists of an efficient indexing technique and a query execution algorithm with various optimizations. We prove that the proposed query execution algorithm is instance optimal. Experiments with our prototype show that DeepEverest, using less than 20% of the storage of full materialization, significantly accelerates individual queries by up to 63x and consistently outperforms other methods on multi-query workloads that simulate DNN interpretation processes.
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads [Download Paper] Arnab Phani (Graz University of Technology)*, Lukas Erlbacher (Graz University of Technology), Matthias Boehm (Graz University of Technology) Data science pipelines are typically exploratory. An integral task of such data science pipelines is feature transformation, which transforms raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLIzing Feature Transformations. UPLIFT constructs a fine-grained task graph from a pipeline of transformations, optimizes the plan according to data characteristics, and executes it in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.
Guided Exploration of Data Summaries [Download Paper] Brit Youngmann (MIT)*, Sihem Amer-yahia (CNRS), Aurélien Personnaz (CNRs, Univ. Grenoble Alpes) Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains ?? individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. Eda4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum that chooses the most useful summary at each step; (ii) RLSum that trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
On-Demand State Separation for Cloud Data Warehousing [Download Paper] Christian Winter (TUM)*, Jana Giceva (TU Munich), Thomas Neumann (TUM), Alfons Kemper (TUM) Moving data analysis and processing to the cloud is no longer reserved for a few companies with petabytes of data. Instead, the flexibility of on-demand resources is attracting an increasing number of customers with small to medium-sized workloads. These workloads do not occupy entire clusters but can run on single worker machines. However, picking the right worker for the job is challenging. Abstracting from worker machines, e.g., using stateless architectures, introduces overheads impacting performance. Solutions without stateless architectures resort to query restarts in the event of an adverse worker matching, wasting already achieved progress. In this paper, we propose migrating queries between workers by introducing on-demand state separation. Using state separation only when required enables maximum flexibility and performance while keeping already achieved progress. To derive the requirements for state separation, we first analyze the query state of medium-sized workloads on the example of TPC-DS SF100. Using this, we analyze the cost and describe the constraints necessary for state separation on such a workload. Furthermore, we describe the design and implementation of on-demand state separation in a compiling database system. Finally, using this implementation, we show the feasibility of our approach on TPC-DS and give a detailed analysis of the cost of query migration and state separation.
Netherite: Efficient Execution of Serverless Workflows [Download Paper] [Information System Architecture] Sebastian C Burckhardt (Microsoft Research)*, Badrish Chandramouli (Microsoft Research), Chris Gillum (Microsoft), David A Justo (Microsoft), Konstantinos Kallas (University of Pennsylvania), Connor Mcmahon (Microsoft), Christopher Meiklejohn (Carnegie Mellon University), Xiangfeng Zhu (University of Washington) Serverless is a popular choice for cloud service architects because it can provide scalability and load-based billing with minimal developer effort. Functions-as-a-service (FaaS) are originally stateless, but emerging frameworks add stateful abstractions. For instance, the widely used Durable Functions (DF) allow developers to write advanced serverless applications, including reliable workflows and actors, in a programming language of choice. DF implicitly and continuosly persists the state and progress of applications, which greatly simplifies development, but can create an IOps bottleneck. To improve efficiency, we introduce Netherite, a novel architecture for executing serverless workflows on an elastic cluster. Netherite groups the numerous application objects into a smaller number of partitions, and pipelines the state persistence of each partition. This improves latency and throughput, as it enables workflow steps to group commit, even if causally dependent. Moreover, Netherite leverages FASTER's hybrid log approach to support larger-than-memory application state, and to enable efficient partition movement between compute hosts. Our evaluation shows that (a) Netherite achieves lower latency and higher throughput than the original DF engine, by more than an order of magnitude in some cases, and (b) that Netherite has lower latency than some commonly used alternatives, like AWS Step Functions or cloud storage triggers.
Tenant Placement in Oversubscribed Database-as-a-Service Clusters [Download Paper] Arnd Christian König (Microsoft)*, Yi Shan (Microsoft), Tobias Ziegler (TU Darmstadt), Aarati Kakaraparthy (University of Wisconsin, Madison), Willis Lang (Microsoft), Justin Moeller (Microsoft), Ajay Kalhan (Microsoft), Vivek Narasayya (Microsoft) Relational cloud database-as-a-service offerings run on multi-tenant infrastructure consisting of clusters of nodes, with each node hosting multiple tenant databases. Such clusters may be over-subscribed to increase resource utilization and improve operational efficiency. When resources are over-subscribed, it becomes possible that a node has insufficient resources to satisfy the resource demands of all databases on it, making it necessary to move databases to other nodes in the cluster. Such moves can significantly impact database performance and availability. Therefore, it is important to avoid such resource shortages through judicious placement of databases on the cluster nodes. We propose a novel tenant placement approach that leverages historical traces of tenant resource demands to assess the likelihood of resource shortages. We have prototyped our techniques in the industrial-strength Service Fabric cluster manager. Experiments using production resource usage traces from Azure SQL DB and an evaluation on a real cluster deployment show significant improvements over state-of-the-art tenant placement techniques.
Redy: Remote Dynamic Memory Cache [Download Paper] Qizhen Zhang (University of Pennsylvania)*, Philip A Bernstein (Microsoft Research), Daniel S Berger (Microsoft Research), Badrish Chandramouli (Microsoft Research) Redy is a cloud service that provides high performance caches using RDMA-accessible remote memory. An application can customize the performance of each cache with a service level objective (SLO) for latency and throughput. By using remote memory, it can leverage stranded memory and spot VM instances to reduce the cost of its caches and improve data center resource utilization. Redy automatically customizes the resource configuration for the given SLO, handles the dynamics of remote memory regions, and recovers from failures. The experimental evaluation shows that Redy can deliver its promised performance and robustness under remote memory dynamics in the cloud. We augment a production key-value store, FASTER, with a Redy cache. When the working set exceeds local memory, using Redy is significantly faster than spilling to SSDs.
OceanBase: A 707 Million tpmC Distributed Relational Database System [Download Paper] [Industry] Zhenkun YANG (OceanBase), Chuanhui Yang (OceanBase), Fusheng Han (OceanBase), MingQiang Zhuang (OceanBase), Bing Yang (OceanBase), Zhifeng Yang (OceanBase), cheng xiaojun (oceanbase), Yuzhong Zhao (oceanbase), Wenhui Shi (OceanBase), huafeng xi (oceanbase.com), Huang Yu (Ant Financial Group), LIU BIN (OceanBase), Yi Pan (OceanBase), BOXUE YIN (OceanBase), Junquan Chen (OceanBase), Quanqing Xu (OceanBase)* We have designed and developed OceanBase, a distributed relational database system from the very basics for a decade. Being a scale-out multi-tenant system, OceanBase is cross-region fault tolerant, which is based on the shared-nothing architecture. Besides sharing many similar goals with alternative distributed DBMS, such as horizontal scalability, fault-tolerance, etc., our design has been driven by demands of typical RDBMS compatibility as well as both on-premise and off-premise deployments. OceanBase has fulfilled its design goal. It implements the salient features of certain mainstream classical RDBMS, and most applications on them can run on OceanBase, with or without a few minor modifications. Tens of thousands of OceanBase servers have been deployed in Alipay.com as well as many other commercial organizations. It has also successfully passed the TPC-C benchmark test and seized the first place with more than 707 million tpmC. This paper presents the goals, design criteria, infrastructure, and key components of OceanBase including its engines for storage and transaction processing. Further, it details how OceanBase achieves the above leading TPC-C benchmark in a distributed cluster with more than 1,500 servers from 3 zones. It also describes lessons what we have learnt in building OceanBase for more than a decade.
Optimizing Differentially-Maintained Recursive Queries on Dynamic Graphs [Download Paper] Khaled Ammar (University of Waterloo, BorealisAI)*, Siddhartha Sahu (University of Waterloo), Semih Salihoglu (University of Waterloo), Tamer Özsu (University of Waterloo) Differential computation (DC) is a highly general incremental computation/view maintenance technique that can maintain the output of an arbitrary and possibly recursive dataflow computation upon changes to the base inputs of the dataflow. As such, it is a promising technique for graph database management systems (GDBMS) that aim to support continuous recursive queries over dynamic graphs, such as single pair shortest paths, variable-length path, or regular path queries. Although differential computation can be highly efficient for maintaining these queries, it can require prohibitively large amount of memory, as its generality is based on keeping track of all input and output differences of the operators in the dataflow across all iterations. This paper studies how to reduce the memory overheads of DC with the goal of increasing the scalability of systems that adopt it. We propose a suite of optimizations that are based on dropping the differences of operators, both completely or partially, and recomputing these differences when necessary. Our optimizations that drop parts of the differences of an operator require data structures to keep track of the dropped differences, for which we offer solutions that use both deterministic and probabilistic data structures. We present extensive experiments evaluating the scalability and performance trade-offs of our optimizations and demonstrate that they can increase the scalability of a DC-based continuous query processor, implemented as an extension to the GraphflowDB GDBMS, by up to 20¡Á while still providing better performance than rerunning the queries from scratch.
Efficient Maximal Biclique Enumeration for Large Sparse Bipartite Graphs [Download Paper] Lu Chen (Swinburne University of Technology)*, Chengfei Liu (Swinburne University of Technology), Rui Zhou (Swinburne University of Technology), Jiajie Xu (Soochow University), Jianxin Li (Deakin University) Maximal bicliques are effective to reveal meaningful information hidden in bipartite graphs. Maximal biclique enumeration (\textsc{MBE}) is challenging since the number of the maximal bicliques grows exponentially w.r.t. the number of vertices in a bipartite graph in the worst case. However, a large bipartite graph is usually very sparse, which is against the worst case and may lead to fast \textsc{MBE} algorithms. The uncharted opportunity is taking advantage of the sparsity to substantially improve the \textsc{MBE} efficiency for large sparse bipartite graphs. We observe that for a large sparse bipartite graph, a vertex $v$ may converge to a few vertices in the same vertex set as $v$ via its neighbours, which reveals that the enumeration scope for a vertex could be very small. Based on this observation, we propose novel concepts: unilateral coreness for individual vertices, unilateral order for each vertex set and unilateral convergence ($\varsigma$) for a large sparse bipartite graph. $\varsigma$ could be a few thousand for a large sparse bipartite graph with hundreds of million edges. Using the unilateral order, every vertex with $\tau$ unilateral coreness only needs to check at most $2^{\tau}$ combinations so that all maximal bicliques can be enumerated and $\tau$ is bounded by $\varsigma$, which leads to a novel \textsc{MBE} algorithm running in $\mathcal{O}^{*}(2^{\varsigma})$. We then propose a batch-pivots technique to eliminate all enumerations resulting in non-maximal bicliques, which guarantees that every maximal biclique is reported in $\mathcal{O}(\varsigma e)$-delay, where $e$ is the number of edges. We devise novel data structures that allow storing subgraphs at omissible space for further speeding up \textsc{MBE}. Extensive experiments are conducted on synthetic and real large datasets to justify that our proposed algorithm is faster and more scalable than the existing algorithms.
Distributed Hop-Constrained s-t Simple Path Enumeration at Billion Scale [Download Paper] Kongzhang Hao (University of New South Wales)*, Long Yuan (Nanjing University of Science and Technology), Wenjie Zhang (University of New South Wales) Hop-constrained s-t simple path (HC-s-t Path) enumeration is a fundamental problem in graph analysis and has received considerable attention recently. Straightforward distributed solutions are inefficient and suffer from poor scalability when addressing this problem in billion-scale graphs due to the disability of pruning fruitless exploration or huge memory consumption. Motivated by this, in this paper, we aim to devise an efficient and scalable distributed algorithm to enumerate the HC-s-t paths in billion-scale graphs. We first propose a new hybrid search paradigm tailored for HC-s-t path enumeration. Based on the new search paradigm, we devise a distributed enumeration algorithm following the divide-and-conquer strategy. The algorithm can not only prune fruitless exploration, but also well bound the memory consumption with computation resource fully utilized. We also devise an effective workload balance mechanism that is automatically triggered by the idle machines to handle skewed workloads. Moreover, we explore the bidirectional search strategy to further improve enumeration efficiency. The experiment results demonstrate the efficiency of our proposed algorithm.
Subgraph Matching over Graph Federation [Download Paper] Ye Yuan (Beijing Institute of Technology)*, Delong Ma (Northeastern University, China), Zhenyu Wen (Zhejiang University of Technology), Zhiwei Zhang (Beijing Institute of Technology), Guoren Wang (Beijing Institute of Technology) Many real-life applications require processing graph data across heterogeneous sources. In this paper, we define the graph federation that indicates that the graph data sources are temporarily federated and offer their data for users. Next, we propose a new framework FedGraph to efficiently and effectively perform subgraph matching, which is a crucial application in graph federation. FedGraph consists of three phases, including query decomposition, distributed matching, and distributed joining. We also develop new efficient approximation algorithms and apply them in each phase to attack the NP-hard problem. The evaluations are conducted in a real test bed using both real-life and synthetic graph datasets. FedGraph outperforms the state-of-the-art methods, reducing the execution time and communication cost by 37.3 ¡Á and 61.8 ¡Á, respectively.
ByteGraph: A High Performance Distributed Graph Database in ByteDance [Industry] [Download Paper] Changji Li (CUHK)*, Hongzhi CHEN (ByteDance), Shuai Zhang (Bytedance), Yingqian HU (ByteDance), Chao Chen (ByteDance), Zhenjie ZHANG (ByteDance), Meng LI (ByteDance), Xiangchen Li (ByteDance), Dongqing Han (ByteDance), Xiaohui Chen (Bytedance Ltd), Xudong Wang (bytedance), Huiming Zhu (ByteDance), Xuwei FU (bytedance), Tingwei Wu (ByteDance), Hongfei Tan (ByteDance), Hengtian Ding (ByteDance), Mengjin Liu (ByteDance), Kangcheng WANG (ByteDance), Ting Ye (ByteDance), Lei LI (ByteDance), Xin Li (ByteDance), Yu Wang (ByteDance), Chenguang Zheng (CUHK), Hao Yang (Bytedance.com), James Cheng (CUHK) Most products at ByteDance, e.g., TikTok, Douyin, and Toutiao, naturally generate massive amounts of graph data. To efficiently store, query and update massive graph data is challenging for the broad range of products at ByteDance with various performance requirements. We categorize graph workloads at ByteDance into three types: online analytical, transaction, and serving processing, where each workload has its own characteristics. Existing graph databases have different performance bottlenecks in handling these workloads and none can efficiently handle the scale of graphs at ByteDance. We developed ByteGraph to process these graph workloads with high throughput, low latency and high scalability. There are several key designs in ByteGraph that make it efficient for processing our workloads, including edge-trees to store adjacency lists for high parallelism and low memory usage, adaptive optimizations on thread pools and indexes, and geographic replications to achieve fault tolerance and availability. ByteGraph has been in production use for several years and its performance has shown to be robust for processing a wide range of graph workloads at ByteDance.
Witan: Unsupervised Labelling Function Generation for Assisted Data Programming [Download Paper] Benjamin Denham (Auckland University of Technology)*, Edmund M K Lai (AUT, NZ), Roopak Sinha (AUT), M. Asif Naeem (National University of Computer & Emerging Sciences) Effective supervised training of modern machine learning models often requires large labelled training datasets, which could be prohibitively costly to acquire for many practical applications. Research addressing this problem has sought ways to leverage weak supervision sources, such as the user-defined heuristic labelling functions used in the data programming paradigm, which are cheaper and easier to acquire. Automatic generation of these functions can make data programming even more efficient and effective. However, existing approaches rely on initial supervision in the form of small labelled datasets or interactive user feedback. In this paper, we propose Witan, an algorithm for generating labelling functions without any initial supervision. This flexibility affords many interaction modes, including unsupervised dataset exploration before the user even defines a set of classes. Experiments in binary and multi-class classification demonstrate the efficiency and classification accuracy of Witan compared to alternative labelling approaches.
Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation [Download Paper] Di Jin (University of Michigan)*, Bunyamin Sisman (Amazon, USA), Hao Wei (Amazon, USA), Xin Luna Dong (Meta), Danai Koutra (U Michigan) Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled training data becomes expensive when the data from many sources arrives incrementally over time. Moreover, the trained models can easily overfit to specific data sources, and thus fail to generalize to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. AdaMEL models the attribute importance that is used to match entities through an attribute-level self-attention mechanism, and leverages the massive unlabeled data from new data sources through domain adaptation to make it generic and data-source agnostic. In addition, AdaMEL is capable of incorporating an additional set of labeled data to more accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning. Besides, it is more stable in handling different sets of data sources in less runtime.
PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication [Download Paper] Yue Wang (Microsoft Research)*, Vivek Narasayya (Microsoft), Yeye He (Microsoft Research), Surajit Chaudhuri (Microsoft) The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in numerous real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based Distributed Agglomerative Hierarchical Clustering (PACk) algorithm using a novel distance-based partitioning algorithm and a novel distance-aware merging algorithm. We develop an efficient implementation on Spark that can cluster over 250 million records in 40 minutes using only 16 commodity VMs. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2¡Á to 19¡Á (median=9¡Á) speedup across a variety of synthetic and real-world datasets.
DQDF: Data-Quality-Aware Dataframes [Download Paper] [Scalable Data Science] Phanwadee Sinthong (University of California, Irvine)*, Dhaval Patel (IBM Research), Nianjun Zhou (IBM Research), Shrey Shrivastava (IBM Research), Arun Iyengar (IBM T.J. Watson Research Center), Anuradha Bhamidipaty (IBM Watson Research Center) Data quality assessment is an essential process of any data analysis process including machine learning. The process is time-consuming as it involves multiple independent data quality checks that are performed iteratively at scale on evolving data resulting from exploratory data analysis (EDA). Existing solutions that provide computational optimizations for data quality assessment often separate the data structure from its data quality which then requires efforts from users to explicitly maintain state-like information. They demand a certain level of distributed system knowledge to ensure high-level pipeline optimizations from data analysts who should instead be focusing on analyzing the data. We, therefore, propose data-quality-aware dataframes, a data quality management system embedded as part of a data analyst's familiar data structure, such as a Python dataframe. The framework automatically detects changes in datasets¡¯ metadata and exploits the context of each of the quality checks to provide efficient data quality assessment on ever-changing data. We demonstrate in our experiment that our approach can reduce the overall data quality evaluation runtime by 40-80% in both local and distributed setups with less than 10% increase in memory usage.
Computing How-Provenance for SPARQL Queries via Query Rewriting [Download Paper] Daniel Hernández (Aalborg University), Luis Galárraga (INRIA)*, Katja Hose (Aalborg University) Over the past few years, we have witnessed the emergence of large knowledge graphs built by extracting and combining information from multiple sources. This has propelled many advances in query processing over knowledge graphs, however the aspect of providing provenance explanations for query results has so far been mostly neglected. We therefore propose a novel method, SPARQLprov, based on query rewriting, to compute how-provenance polynomials for SPARQL queries over knowledge graphs. Contrary to existing works, SPARQLprov is system-agnostic and can be applied to standard and already deployed SPARQL engines without the need of customized extensions. We rely on spm-semirings to compute polynomial annotations that respect the property of commutation with homomorphisms on monotonic and non-monotonic SPARQL queries without aggregate functions. Our evaluation on real and synthetic data shows that SPARQLprov over standard engines incurs an acceptable runtime overhead w.r.t. the original query, competing with state-of-the-art solutions for how-provenance computation.
Fast Detection of Denial Constraint Violations [Download Paper] Eduardo H. M. Pena (UTFPR)*, Eduardo Cunha De Almeida (UFPR), Felix Naumann (Hasso Plattner Institute) The detection of constraint-based errors is a critical task in many data cleaning solutions. Previous works perform the task either using traditional data management systems or using specialized systems that speed up error detection. Unfortunately, both approaches may fail to execute in a reasonable time or even exhaust the available memory in the attempt. To address the main drawbacks of previous approaches, we present the FAst Constraint-based Error DeTector (FACET) to detect violations of denial constraints (DCs). FACET uses column sketch information to organize a pipeline of special operators for DC predicates and it implements these operators using a set of efficient algorithms and data structures that adapt to different data characteristics and predicate structures. We evaluate our system on a diverse array of datasets and constraints, showing its robustness and performance gains compared to different types of DBMSs and to a specialized system.
Detecting Layout Templates in Complex Multiregion Files [Download Paper] Gerardo Vitagliano (Hasso Plattner Institute)*, Lan Jiang (Hasso Plattner Institute), Felix Naumann (Hasso Plattner Institute) Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as ¡°multiregion¡± files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.
Entity Resolution On-Demand [Download Paper] Giovanni Simonini (University of Modena and Reggio Emilia)*, Luca Zecchini (Università degli Studi di Modena e Reggio Emilia), Sonia Bergamaschi (Università di Modena e Reggio Emilia), Felix Naumann (Hasso Plattner Institute) Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user¡¯s application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner¡ªa fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.
Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow [Download Paper] Siyuan Xia (University of Chicago)*, Zhiru Zhu (University of Chicago), Christopher Zhu (The University of Chicago), Jinjin Zhao (University of Chicago), Kyle Chard (Computation Institute), Aaron J Elmore (University of Chicago), Ian Foster (University of Chicago & Argonne Nat Lab), Michael Franklin (University of Chicago), Sanjay Krishnan (U Chicago), Raul Castro Fernandez (UChicago) Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log. We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show: i) Data Station outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) it is orders of magnitude faster than alternative secure data-sharing frameworks; iii) it introduces small overhead on the critical path.
Efficient and Effective Data Imputation with Influence Functions [Download Paper] [Scalable Data Science] Xiaoye Miao (Zhejiang University)*, Yangyang Wu (Zhejiang University), Lu Chen (Zhejiang University), Yunjun Gao (Zhejiang University), Jun Wang (The Hong Kong University of Science and Technology), Jianwei Yin (Zhejiang University) Data imputation has been extensively explored to solve the missing data problem. The dramatically rising volume of missing data makes the training of imputation models computationally infeasible in real-life scenarios. In this paper, we propose an efficient and effective data imputation system with influence functions, named EDIT, which quickly trains a parametric imputation model with representative samples under imputation accuracy guarantees. EDIT mainly consists of two modules, i.e., an imputation influence evaluation (IIE) module and a representative sample selection (RSS) module. IIE leverages the influence functions to estimate the effect of (in)complete samples on the prediction result of parametric imputation models. RSS builds a minimum set of the high-effect samples to satisfy a user-specified imputation accuracy. Moreover, we introduce a weighted loss function that drives the parametric imputation model to pay more attention on the high-effect samples. Extensive experiments upon ten state-of-the-art imputation methods demonstrate that, EDIT adopts only about 5% samples to speed up the model training by 4x in average with more than 11% accuracy gain.
A New Distributional Treatment for Time Series and An Anomaly Detection Investigation [Download Paper] Kai Ming Ting (Nanjing University), Zongyou Liu (Nanjing University)*, Hang Zhang (Nanjing University), Ye Zhu (Deakin University) Time series is traditionally treated with two main approaches, i.e., the time domain approach and the frequency domain approach. These approaches must rely on a sliding window so that time-shift versions of a periodic subsequence can be measured to be similar. Coupled with the use of a root point-to-point measure, existing methods often have quadratic time complexity. We offer the third R domain approach. It begins with an insight that subsequences in a periodic time series can be treated as sets of independent and identically distributed (iid) points generated from an unknown distribution in R. This R domain treatment enables two new possibilities: (a) the similarity between two subsequences can be computed using a distributional measure such as Wasserstein distance (WD), kernel mean embedding or Isolation Distributional kernel (IDK); and (b) these distributional measures become non-slidingwindow-based. Together, they offer an alternative that has more effective similarity measurements and runs significantly faster than the point-to-point and sliding-window-based measures. Our empirical evaluation shows that IDK and WD are effective distributional measures for time series; and IDK-based detectors have better detection accuracy than existing sliding-window-based detectors, and they run faster with linear time complexity.
Enabling Efficient and General Subpopulation Analytics In Multidimensional Data Streams [Download Paper] Antonis Manousis (Carnegie Mellon University)*, Zhuo Cheng (Peking University), Zaoxing Liu (Boston University), Ran Ben Basat (UCL), Vyas Sekar (Carnegie Mellon University) Many large-scale services and infrastructures (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics (e.g., cardinality, entropy, frequency moments, norms) across multiple subpopulations of multidimensional datasets. However, state-of-art frameworks do not offer general and accurate analytics in real-time at reasonable operational cost. The root cause is the combinatorial explosion of data subpopulations coupled with the diversity of summary statistics we need to simultaneously monitor. In this work, we present Hydra, an efficient framework for multidimensional analytics that builds on two key ideas. First, it avoids the overhead of monitoring exponentially-many subpopulations with a ¡°sketch of sketches¡± that summarizes data streams with sub-linear space complexity to the number of data subpopulations. Second, Hydra leverages universal sketching to ensure high-fidelity estimations for a broad set of statistics, thus making the time/space complexity independent of the number of different summary statistics. We implement a prototype of Hydra as an Apache Spark plugin and evaluate it on both real-world and synthetic multidimensional datasets. We also tackle practical system challenges to ensure low overheads and large scale. We show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory foot- print than existing analytics engines (e.g., Spark, Druid) while ensuring interactive estimation times.
SCAR - Spectral Clustering Accelerated and Robustified [Download Paper] Ellen Hohma (Technical University of Munich), Christian M.m. Frey (Christian-Albrechts-University Kiel), Anna Beer (LMU Munich)*, Thomas Seidl (LMU Munich) Spectral clustering is one of the most advantageous clustering approaches. However, standard Spectral Clustering is sensitive to noisy input data and has a high runtime complexity. Tackling one of these problems often exacerbates the other. As real-world datasets are often large and compromised by noise, we need to improve both robustness and runtime at once. Thus, we propose Spectral Clustering - Accelerated and Robust (SCAR), an accelerated, robustified spectral clustering method. In an iterative approach, we achieve robustness by separating the data into two latent components: cleansed and noisy data. We accelerate the eigendecomposition ¨C the most time-consuming step ¨C based on the Nystr?m method. We compare SCAR to related recent state-of-the-art algorithms in extensive experiments. SCAR surpasses its competitors in terms of speed and clustering quality on highly noisy data.
On Detecting Cherry-picked Generalizations [Download Paper] Yin Lin (University of Michigan)*, Brit Youngmann (MIT), Yuval Moskovitch (University of Michigan), H. V. Jagadish (University of Michigan), Tova Milo (Tel Aviv University) Generalizing from detailed data to statements in a broader context is often critical for users to make sense of large data sets. Correspondingly, poorly constructed generalizations might convey misleading information even if the statements are technically supported by the data. For example, a cherry-picked level of aggregation could obscure substantial sub-groups that oppose the generalization. We present a framework for detecting and explaining cherry-picked generalizations by refining aggregate queries. We present a scoring method to indicate the appropriateness of the generalizations. We design efficient algorithms for score computation. For providing a better understanding of the resulting score, we also formulate practical explanation tasks to disclose significant counterexamples and provide better alternatives to the statement. We conduct experiments using real-world data sets and examples to show the effectiveness of our proposed evaluation metric and the efficiency of our algorithmic framework.
BABOONS: Black-Box Optimization of Data Summaries in Natural Language [Download Paper] Immanuel Trummer (Cornell)* BABOONS (BlAck BOx OptimizatioN of data Summaries) is a system that automatically optimizes text data summaries for an arbitrary, user-defined utility function. Data summaries use relational data to compare user-defined items to others in terms of aggregate values for data subsets. For instance, BABOONS supports text evaluation by user-provided models for text analysis. BABOONS uses reinforcement learning to explore the space of possible descriptions. In each iteration, BABOONS generates summaries and evaluates their utility. To reduce data processing overheads during summary generation, BABOONS uses a proactive processing strategy that dynamically merges current with likely future queries for efficient processing. Also, BABOONS supports scenario-specific sampling and batch processing strategies. These mechanisms allow to scale processing to large data and item sets. The experiments show that BABOONS scales significantly better than baselines. Also, they show that summaries generated by BABOONS receive higher average grades from users in a large survey.
Multivariate Correlations Discovery in Static and Streaming Data [Download Paper] Koen Minartz (Eindhoven University of Technology), Jens D'hondt (TU Eindhoven), Odysseas Papapetrou (TU Eindhoven)* Correlation analysis is an invaluable tool in many domains, for better understanding data and extracting salient insights. Most works to date focus on detecting high pairwise correlations. A generalization of this problem with known applications but no known efficient solutions involves the discovery of strong multivariate correlations, i.e., finding vectors (typically in the order of 3 to 5 vectors) that exhibit a strong dependence when considered altogether. In this work we propose algorithms for detecting multivariate correlations in static and streaming data. Our algorithms, which rely on novel theoretical results, support two different correlation measures, and allow for additional constraints. Our extensive experimental evaluation examines the properties of our solution and demonstrates that our algorithms outperform the state-of-the-art, typically by an order of magnitude.
Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows [Download Paper] Doris Lee (UC Berkeley)*, Dixin Tang (University of California, Berkeley), Kunal Agarwal (University of California, Berkeley), Thyne Boonmark (UC Berkeley), Caitlyn Chen (University of California, Berkeley), Jake Kang (UC Berkeley), Ujjaini Mukhopadhyay (UC Berkeley), Jerry Song (University of California, Berkeley), Micah Yong (UC Berkeley), Marti A. Hearst (), Aditya Parameswaran (University of California, Berkeley) Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose LUX, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, LUX recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. LUX features a high-level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, LUX adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate LUX in terms of usability via interviews with early adopters, finding that LUX helps fulfill the needs of data scientists for visualization support within their dataframe workflows. LUX has already been embraced by data science practitioners, with over 3.1k stars on Github.
Distributed Learning of Fully Connected Neural Networks using Independent Subnet Training [Download Paper] [Scalable Data Science] Binhang Yuan (Rice University), Cameron Wolfe (Rice University)*, Chen Dun (Rice University), Yuxin Tang (Rice University ), Anastasios Kyrillidis (Rice University ), Chris Jermaine (Rice University) Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.
Fast Network K-function-based Spatial Analysis [Download Paper] Tsz Nam Chan (Hong Kong Baptist University)*, Leong Hou U (University of Macau), Yun Peng (Guangzhou University), Byron Choi (Hong Kong Baptist University), Jianliang Xu (Hong Kong Baptist University) Network $K$-function has been the de facto operation for analyzing point patterns in spatial networks, which is widely used in many communities, including geography, ecology, transportation science, social science, and criminology. To analyze a location dataset, domain experts need to generate a network $K$-function plot that involves computing multiple network $K$-functions. However, network $K$-function is a computationally expensive operation that is not feasible to support large-scale datasets, let alone to generate a network $K$-function plot. To handle this issue, we develop two efficient algorithms, namely count augmentation (CA) and neighbor sharing (NS), which can reduce the worst-case time complexity for computing network $K$-functions. In addition, we incorporate the advanced shortest path sharing (ASPS) approach into these two methods to further lower the worst-case time complexity for generating network $K$-function plots. Experiment results on four large-scale location datasets (up to 7.33 million data points) show that our methods can achieve up to 167.52x speedup compared with the state-of-the-art methods.
Accelerating Recommendation System Training by Leveraging Popular Choices [Download Paper] Muhammad Adnan (University of British Columbia), Yassaman Ebrahimzadeh Maboud (University of British Columbia), Divya Mahajan (Microsoft)*, Prashant J. Nair (University of British Columbia ) Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items¡¯ and users¡¯ categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models.We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000¡Á more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3¡Á and 1.52¡Á in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.
Cardinality Estimation of Approximate Substring Queries using Deep Learning [Download Paper] Suyong Kwon (Seoul National University), Woohwan Jung (Hanyang University)*, Kyuseok Shim (Seoul National University) Cardinality estimation of an approximate substring query is an important problem in database systems. Traditional approaches build a summary from the text data and estimate the cardinality using the summary with some statistical assumptions. Since deep learning models can learn underlying complex data patterns effectively, they have been successfully applied and shown to outperform traditional methods for cardinality estimations of queries in database systems. However, since they are not yet applied to approximate substring queries, we investigate a deep learning approach for cardinality estimation of such queries. Although the accuracy of deep learning models tends to improve as the train data size increases, producing a large train data is computationally expensive for cardinality estimation of approximate substring queries. Thus, we develop the efficient train data generation algorithms by avoiding unnecessary computations and sharing common computations. We also propose a deep learning model as well as a novel learning method to quickly obtain an accurate deep learning based estimator. Extensive experiments confirm the superiority of our data generation algorithms and deep learning model with the novel learning method.
SCARA: Scalable Graph Neural Networks with Feature-Oriented Optimization [Download Paper] [Scalable Data Science] Ningyi Liao (Nanyang Technological University)*, Dingheng Mo (Nanyang Technological University), Siqiang Luo (Nanyang Technological University), Xiang Li (East China Normal University), Pengcheng Yin (Carnegie Mellon University) Recent advances in data processing have stimulated the demand for learning graphs of very large scales. Graph Neural Networks (GNNs), being an emerging and powerful approach in solving graph learning tasks, are known to be difficult to scale up. Most scalable models apply node-based techniques in simplifying the expensive graph message-passing propagation procedure of GNN. However, we find such acceleration insufficient when applied to million- or even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with feature-oriented optimization for graph computation. SCARA efficiently computes graph embedding from node features, and further selects and reuses feature computation results to reduce overhead. Theoretical analysis indicates that our model achieves sub-linear time complexity with a guaranteed precision in propagation process as well as GNN training and inference. We conduct extensive experiments on various datasets to evaluate the efficacy and efficiency of SCARA. Performance comparison with baselines shows that SCARA can reach up to 100¡Á graph propagation acceleration than current state-of-the-art methods with fast convergence and comparable accuracy. Most notably, it is efficient to process precomputation on the largest available billion-scale GNN dataset Papers100M (111M nodes, 1.6B edges) in 100 seconds.
Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices [Download Paper] Paolo Ferragina (Università di Pisa), Giovanni Manzini (University of Pisa), Travis Gagie (Dalhousie University), Dominik Köppl (TMDU), Gonzalo Navarro (University of Chile), Manuel Striani (University of Piemonte Orientale), Francesco Tosoni (Università di Pisa)* As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the k-th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach.
PerfGuard: Deploying ML-for-Systems without Performance Regressions, Almost! [Download Paper] H M Sajjad Hossain (Microsoft)*, Marc T Friedman (Microsoft), Hiren Patel (Microsoft), Shi Qiao (Microsoft), Soundar Srinivasan (Microsoft), Markus Weimer (Microsoft), Remmelt Ammerlaan (Microsoft), Lucas Rosenblatt (NYU), Gilbert Antonius (Microsoft), Peter Orenberg (Microsoft), Vijay Ramani (Microsoft), Abhishek Roy (Microsoft), Irene Shaffer (Microsoft), Alekh Jindal (Keebo) Modern cloud workloads require tuning and optimization at massive scales, and automated optimizations using machine learning models (ML-for-Systems) have shown promising results. The machine learning models, however, are subject to over generalizations that do not capture the large variety of workload patterns, and tend to augment the performance of certain subsets in the workload while regressing performance for others. In this paper, we introduce a performance safeguard system (PerfGuard) that assists in designing pre-production experiments to inform model deployment. Our experimentation pipeline circumvents searching the entire query plan space (a well-known, intractable problem), and instead focuses on plan structure deltas (a significantly smaller space). Our ML approach formalizes these differences, and correlates plan deltas to important feedback signals, like execution cost. We share our end-to-end pipeline structure and deep learning architecture as a prototype system for use with general relational databases. We demonstrate that this architecture improves on baseline models, and that our pipeline identifies key query plan components as major contributors to plan disparity. In offline experimentation, focusing on plan changes shows validity as a promising approach, with many opportunities for future improvement.
TSCache: An Efficient Flash-based Caching Scheme for Time-series Data Workloads [Download Paper] Jian Liu (Louisiana State University)*, Kefei Wang (Louisiana State University), Feng Chen (Louisiana State University) Time-series databases are becoming an indispensable component in today's data centers. In order to manage the rapidly growing time-series data, we need an effective and efficient system solution to handle the huge traffic of time-series data queries. A promising solution is to deploy a high-speed, large-capacity cache system to relieve the burden on the backend time-series databases and accelerate query processing. However, time-series data is drastically different from other traditional data workloads, bringing both challenges and opportunities. In this paper, we present a flash-based cache system design for time-series data, called TSCache. By exploiting the unique properties of time-series data, we have developed a set of optimization schemes, such as a slab-based data management, a two-layered data indexing structure, an adaptive time-aware caching policy, and a low-cost compaction process. We have implemented a prototype based on Twitter's Fatcache. Our experimental results show that TSCache can significantly improve client query performance, effectively increasing the bandwidth by a factor of up to 6.7 and reducing the latency by up to 84.2%.
Are Updatable Learned Indexes Ready? [Download Paper] Chaichon Wongkham (The Chinese University of Hong Kong)*, Baotong Lu (Chinese University of Hong Kong), Chris Liu (Chinese University of Hong Kong), Zhicong Zhong (Chinese University of Hong Kong), Eric Lo (Chinese University of Hong Kong), Tianzheng Wang (Simon Fraser University) Recently, numerous promising results have shown that updatable learned indexes can perform better than traditional indexes with much lower memory space consumption. But it is unknown how these learned indexes compare against each other and against the traditional ones under realistic workloads with changing data distributions and concurrency levels. This makes practitioners still wary about how these new indexes would actually behave in practice. To fill this gap, this paper conducts the first comprehensive evaluation on updatable learned indexes. Our evaluation uses ten real datasets and various workloads to challenge learned indexes in three aspects: performance, memory space efficiency and robustness. Based on the results, we give a series of takeaways that can guide the future development and deployment of learned indexes.
TAOBench: An End-to-End Benchmark for Social Networking Workloads [Download Paper] Audrey Cheng (UC Berkeley)*, Xiao Shi (Facebook, Inc.), Aaron N Kabcenell (Facebook), Shilpa Lawande (Facebook, Inc.), Hamza Qadeer (University of California, Berkeley), Jason Chan (UC Berkeley), Harrison Tin ( University of California, Berkeley), Ryan Zhao (University of California, Berkeley), Peter Bailis (), Mahesh Balakrishnan (Microsoft Research), Nathan Bronson (Rockset), Natacha Crooks (UC Berkeley), Ion Stoica (UC Berkeley) The continued emergence of large social network applications has introduced a scale of data and query volume that challenges the limits of existing data stores. However, few benchmarks accurately simulate these request patterns, leaving researchers in short supply of tools to evaluate and improve upon these systems. In this paper, we present a new benchmark, TAOBench, that captures the social graph workload at Meta. We open source workload configurations along with a benchmark that leverages these request features to both accurately model production workloads and generate emergent application behavior. We ensure the integrity of TAOBench¡¯s workloads by validating them against their production counterparts. We also describe several benchmark use cases at Meta and report results for five popular distributed database systems to demonstrate the benefits of using TAOBench to evaluate system tradeoffs as well as identify and address performance issues. Our benchmark fills a gap in the available tools and data that researchers and developers have to inform system design decisions.
Containerized Execution of UDFs: An Experimental Evaluation [Download Paper] Karla Saur (Microsoft)*, Tara Mirmira (University of California, San Diego), Konstantinos Karanasos (Meta), Jesús Camacho-rodríguez (Microsoft) User-defined functions (UDFs) have long been used as the de facto way to extend the capabilities of data management systems. However, they are restricted to the specificities of each DBMS, and recent demands for advanced analytics have increased the need for complex UDFs that may require execution of arbitrary computation written in any programming language, management of library dependencies, portability across environments and engines, and resource isolation. These requirements go beyond what traditional UDFs were designed for, and have given rise to containerized UDFs that enable encapsulation and portability. However, this approach is nascent and can result in significant performance penalties and usability issues. In this paper, we present the first study that spans all stages of containerized UDFs¡¯ life cycle, performance bottlenecks in their execution, and extensibility to support different engines. Our experiments show that the performance of containerized UDF execution can be greatly affected by system design choices and that there are many trade-offs to consider. For example, regarding the method of communication with the containerized UDF, we show that binary-based implementations minimize overheads and are more than 2.4x faster than widely used text-based ones. Adopting a newer general-purpose communication method such as Arrow Flight can improve performance dramatically, causing a minimal ?10% slowdown compared to non-containerized UDFs. Additionally, containerized UDF start times vary wildly due to program size and complexity, from .07s to 7s in our experiments. Our insights can help DBMS developers make appropriate choices based on individual use cases when designing their systems.
LlamaTune: Sample-Efficient DBMS Configuration Tuning [Download Paper] Konstantinos Kanellis (University of Wisconsin-Madison)*, Cong Ding (University of Wisconsin-Madison), Brian Kroth (Microsoft), Andreas C Mueller (Microsoft), Carlo Curino (Microsoft), Shivaram Venkataraman (University of Wisconsin, Madison) Tuning a database system to achieve optimal performance on a given workload is a long-standing problem in the database community. A number of recent works have leveraged ML-based approaches to guide the sampling of large parameter spaces (hundreds of tuning knobs) in search for high performance configurations. Looking at Microsoft production services operating millions of databases, sample efficiency emerged as a crucial requirement to use tuners on diverse workloads. This motivates our investigation in LlamaTune, a tuner design that leverages domain knowledge to improve the sample efficiency of existing optimizers. LlamaTune employs an automated dimensionality reduction technique based on randomized projections, a biased-sampling approach to handle special values for certain knobs, and knob values bucketization, to reduce the size of the search space. LlamaTune compares favorably with the state-of-the-art optimizers across a diverse set of workloads. It identifies the best performing configurations with up to 11x fewer workload runs, and reaching up to 21% higher throughput. We also show that benefits from LlamaTune generalize across both BO-based and RL-based optimizers, as well as different DBMS versions. While the journey to perform database tuning at cloud-scale remains long, LlamaTune goes a long way in making automatic DBMS tuning practical at scale.
Discovering Association Rules from Big Graphs [Download Paper] Wenfei Fan (Univ. of Edinburgh ), Wenzhi Fu (University of Edinburgh), Ruochun Jin (National University of Defense Technology), Ping Lu (Beihang Univ.), Chao Tian (Chinese Academy of Sciences)* This paper tackles two challenges to discovery of graph rules. Existing discovery methods often (a) return an excessive number of rules, and (b) do not scale with large graphs given the intractability of the discovery problem. We propose an application-driven strategy to cut back rules and data that are irrelevant to users¡¯ interests, by training a machine learning (ML) model to identify data pertaining to a given application. Moreover, we introduce a sampling method to reduce a big graph G to a set H of small sample graphs. Given expected support and recall bounds, the method is able to deduce samples in H and mine rules from H to satisfy the bounds in the entire G. As proof of concept, we develop an algorithm to discover Graph Association Rules (GARs), which is a combination of graph patterns and attribute dependencies, and may embed ML classifiers as predicates. We show that the algorithm is parallelly scalable, i.e., it guarantees to reduce runtime when more machines are used. We experimentally verify that the method is able to discover rules with recall above 91% when using sample ratio 10%, with speedup of 61 times.
Towards Event Prediction in Temporal Graphs [Download Paper] Wenfei Fan (Univ. of Edinburgh ), Ruochun Jin (National University of Defense Technology), Ping Lu (Beihang Univ.), Chao Tian (Chinese Academy of Sciences)*, Ruiqi Xu (National University of Singapore) This paper proposes a class of temporal association rules, denoted by TACOs, for event prediction. As opposed to previous graph rules, TACOs monitor updates to graphs, and can be used to capture temporal interests in recommendation and catch frauds in response to behavior changes, among other things. TACOs are defined on temporal graphs in terms of change patterns and (temporal) conditions, and may carry machine learning (ML) predicates for temporal event prediction. We settle the complexity of reasoning about TACOs, including their satisfiability, implication and prediction problems. We develop a system, referred to as TASTE. TASTE discovers TACOs by iteratively training a rule creator based on generative ML models in a creator-critic framework. Moreover, it predicts events by applying the discovered TACOs. Using real-life and synthetic datasets, we experimentally verify that TASTE is on average 31.4 times faster than conventional data mining methods in TACO discovery, and it improves the accuracy of state-of-the-art event prediction models by 23.4%.
Towards Distributed Bitruss Decomposition on Bipartite Graphs [Download Paper] Yue Wang (Shenzhen Institute of Computing Sciences)*, Ruiqi Xu (National University of Singapore), Xun Jian (HKUST), Alexander Zhou (Hong Kong University of Science and Technology), Lei Chen (Hong Kong University of Science and Technology) Mining cohesive subgraphs on bipartite graphs is an important task. The k-bitruss is one of popular cohesive subgraph models, which is the maximal subgraph where each edge is contained in at least k butterflies. The bitruss decomposition problem is to find all k-bitrusses for k ¡Ý 0. Dealing with large graphs is often beyond the ability of a single machine due to its limited memory and computational power, leading to a need for efficiently processing large graphs in a distributed environment. However, all current solutions are for a single machine and centralized environment, where processors can access the graph or auxiliary indexes randomly and globally. It is difficult to directly deploy such algorithms on shared-nothing model. In this paper, we propose distributed algorithms for bitruss decomposition. We first propose SC-HBD as baseline, which uses H -function to define bitruss numbers and computes them iteratively to a fix point in parallel. We then introduce a subgraph-centric peeling method SC-PBD, which peels edges in batches over different butterfly complete subgraphs. We then introduce local indexes on each fragment, study the butterfly-aware edge partition problem including its hardness, and propose an effective partitioner. We finally present the concept bitruss butterfly-complete subgraph, and present an divide and conquer method DC-BD with optimisation strategies. Extensive experiments show the proposed methods solves graphs with 30 trillion of butterflies in 2.5 hours, while existing parallel methods under shared-memory model fail to scale to such large graphs.
Analysis of Influence Contribution in Social Advertising [Download Paper] Yuqing Zhu (Nanyang technological university ), Jing Tang (The Hong Kong University of Science and Technology)*, Xueyan Tang (Nanyang Technological University), Lei Chen (Hong Kong University of Science and Technology) In today's social advertising models, Online Social Network (OSN) providers usually conduct advertising campaigns by inserting social ads into promoted posts. Whenever a user engages in a promoted ad, the engaging user may further propagate the promoted ad to her followers recursively and the propagation process is known as the word-of-mouth effect. In order to spread the promotion cascade widely and efficiently, the OSN provider often tends to select the influencers, who normally have large audiences over the social network, to initiate the advertising campaign. This marketing model, also termed as influencer marketing, has been gaining increasing traction and investment and is rapidly becoming one of the most widely-used channels in digital marketing. In this paper, we formulate the problem for the OSN provider to derive the influence contributions of influencers given the campaign result, considering the viral propagation of the ads, namely influence contribution allocation (ICA). We make a connection between ICA and the concept of Shapley value in cooperative game theory to reveal the rationale behind ICA. A naive method to obtain the solution to ICA is to enumerate all possible cascades delivering the campaign result, resulting in an exponential number of potential cascades with respect to the number of connections between users, which is computationally intractable. Moreover, generating a cascade producing the exact campaign result is non-trivial. Facing the challenges, we develop an exact solution in linear time under the linear threshold (LT) model, and devise a fully polynomial-time randomized approximation scheme (FPRAS) under the independent cascade (IC) model. Specifically, under the IC model, we propose an efficient approach to estimate the expected influence contribution in probabilistic graphs modeling OSNs by designing a scalable sampling method with provable accuracy guarantees. We conduct extensive experiments and show that our algorithms yield solutions with remarkably higher quality over several baselines and improve the sampling efficiency significantly.
LargeEA: Aligning Entities for Large-scale Knowledge Graphs [Download Paper] Congcong Ge (Zhejiang University), Xiaoze Liu (Zhejiang University), Lu Chen (Zhejiang University), Baihua Zheng (Singapore Management University), Yunjun Gao (Zhejiang University)* Entity alignment (EA) aims to find equivalent entities in different knowledge graphs (KGs). Current EA approaches suffer from scalability issues, limiting their usage in real-world EA scenarios. To tackle this challenge, we propose LargeEA to align entities between large-scale KGs. LargeEA consists of two channels, i.e., structure channel and name channel. For the structure channel, we present METIS-CPS, a memory-saving mini-batch generation strategy, to partition large KGs into smaller mini-batches. LargeEA, designed as a general tool, can adopt any existing EA approach to learn entities' structural features within each mini-batch independently. For the name channel, we first introduce NFF, a name feature fusion method, to capture rich name features of entities without involving any complex training process; we then exploit a name-based data augmentation to generate seed alignment without any human intervention. Such design fits common real-world scenarios much better, as seed alignment is not always available. Finally, LargeEA derives the EA results by fusing the structural features and name features of entities. Since no widely-acknowledged benchmark is available for large-scale EA evaluation, we also develop a large-scale EA benchmark called DBP1M extracted from real-world KGs. Extensive experiments confirm the superiority of LargeEA against state-of-the-art competitors.
ForBackBench: A Benchmark for Chasing vs. Query-Rewriting [Download Paper] [Experiments, Analyses & Benchmarks] Afnan G Alhazmi (Southampton University)*, Tom Blount (University of Southampton), George Konstantinidis (University of Southampton) The fields of Data Integration/Exchange (DE) and Ontology Based Data Access (OBDA) have been extensively studied across different communities. The underlying problem is common: using a number of differently structured data-sources mapped to a common mediating schema/ontology/knowledge-graph, answer a query posed on the latter. In DE, forward-chaining algorithms, collectively known as the chase, are used to transform source data to a new materialised instance that satisfies the ontology and can be directly used for query answering. In OBDA, backward-chaining algorithms rewrite the query over the source schema, taking the ontology into account, in order to execute the rewriting directly on the source instances. These two families of reasoning approaches have seen an individual rise in algorithms, practical implementations, and benchmarks. However, there has not been a principled methodology to compare solutions across both areas. In this paper we provide an original methodology and a benchmark infrastructure ¡ª a set of test scenarios, a set of generator and translator tools, and an experimental infrastructure ¡ª to allow the translation and execution of a DE/OBDA scenario across areas and among different chase and query-rewriting systems. In the process, we also present a syntactic restriction of linear Tuple Generating Dependencies that precisely captures DL-Lite_R, a correspondence previously uninvestigated. We perform a series of cross-approach experiments under a wide range of assumptions, such as the use of different source-to-target mapping languages, shedding light to the interplay between forward- and backward-chaining. Our preliminary results show that indeed chase systems can compete and might overcome query rewriting even in the face of large data especially for complex mapping languages.
Generalized Supervised Meta-blocking [Download Paper] Luca Gagliardelli (University of Modena & Reggio Emilia)*, George Papadakis (University of Athens), Giovanni Simonini (University of Modena and Reggio Emilia), Sonia Bergamaschi (Università di Modena e Reggio Emilia), Themis Palpanas (Université Paris Cité) Entity Resolution constitutes a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced through Meta-blocking techniques, which leverage the co-occurrence patterns of entities inside the blocks: first, a weighting scheme assigns a score to every pair of candidate entities in proportion to the likelihood that they are matching and then, a pruning algorithm discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking [Download Paper] associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as features. Through extensive experiments, we identify the best pruning algorithms, their optimal sets of features as well as the minimum possible size of the training set.
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System [Download Paper] Devin Petersohn (UC Berkeley)*, Dixin Tang (University of California, Berkeley), Rehan S Durrani (UC Berkeley), Areg Melik-adamyan (Intel Corporation), Joseph E Gonzalez (UC Berkeley), Anthony Joseph (UC Berkeley), Aditya Parameswaran (University of California, Berkeley) Dataframes have become universally popular as a means to flexibly represent data in various stages of structure, and manipulate it using a rich set of operators¡ªthereby becoming an essential tool in the data scientists¡¯ toolbox. However, dataframe systems, such as pandas, scale poorly¡ªand are non-interactive on moderate to large datasets. We discuss our experiences developing Modin, our first cut at a parallel dataframe system, which already has users across several industries, and considerable traction within the open source GitHub community with over 1M downloads. Modin translates pandas functions into a core set of operators that are individually parallelized via a set of columnar, row-wise, and cell-wise decomposition rules that we formalize in this paper. We also introduce the notion of metadata independence to allow metadata¡ªsuch as order and type information¡ªto be decoupled from the physical representation and maintained in a lazy fashion, computed when needed. Using rule-based decomposition and metadata independence, along with careful engineering, Modin is able to support pandas operations across both rows and columns on very large dataframes¡ªunlike Koalas and Dask DataFrames that either breakdown or are unable to support such operations, while also being much faster than pandas.
A Critical Re-evaluation of Neural Methods for Entity Alignment [Download Paper] Manuel Leone (EPFL), Stefano Huber (EPFL), Akhil Arora (EPFL)*, Alberto Garcia-duran (EPFL), Robert West (EPFL) Neural methods have become the de-facto choice for the vast majority of data analysis tasks, and entity alignment (EA) is no exception. Not surprisingly, more than 50 different neural EA methods have been published since 2017. However, surprisingly, an analysis of the differences between neural and non-neural EA methods has been lacking. We bridge this gap by performing an in-depth comparison among five carefully chosen representative state-of-the-art methods from the pre-neural and neural era. We unravel, and consequently mitigate, the inherent deficiencies in the experimental setup utilized for evaluating neural EA methods. To ensure fairness in evaluation, we homogenize the entity matching modules of neural and non-neural methods. Additionally, for the first time, we draw a parallel between EA and record linkage (RL) by empirically showcasing the ability of RL methods to perform EA. Our results indicate that Paris, the state-of-the-art non-neural method, statistically significantly outperforms all the representative state-of-the-art neural methods in terms of both efficacy and efficiency across a wide variety of dataset types and scenarios, and is second only to BERT-INT for a specific scenario of cross-lingual EA. Our findings shed light on the potential problems resulting from an impulsive application of neural methods as a panacea for all data analytics tasks. Overall, our work results in two overarching conclusions: (1) Paris should be used as a baseline in every follow-up work on EA, and (2) neural methods need to be positioned better to showcase their true potential, for which we provide multiple recommendations.
MATE: Multi-Attribute Table Extraction [Download Paper] Mahdi Esmailoghli (Leibniz Universität Hannover)*, Jorge Arnulfo Quiane Ruiz (TU Berlin), Ziawasch Abedjan (Leibniz Universität Hannover) A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce MATE, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key. We design a filtering layer that uses a novel hash, XASH. This hash function encodes the syntactic features of all column values and aggregates them into a super key, which allows the system to efficiently prune tables with non-joinable rows. Our join discovery system is able to prune up to 1000x more false positives and leads to over 60x faster table discovery in comparison to state-of-the-art.
BAGUA: Scaling up Distributed Learning with System Relaxations [Download Paper] [Scalable Data Science] Shaoduo Gan (ETH Zurich)*, Xiangru Lian (University of Rochester), Rui Wang (Kuaishou Technology), Jianbin Chang (Kuaishou Technology), Chengjun Liu (Kuaishou Technology), Hongmei Shi (Kuaishou Technology), Shengzhuo Zhang (Kuaishou Technology), Xianghong Li (Kuaishou Technology), Tengxu Sun (Kuaishou Technology), Jiawei Jiang (Wuhan University), Binhang Yuan (ETH Zurich), Sen Yang (Kwai Inc.), Ji Liu (Kwai Inc.), Ce Zhang (ETH) Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via 'system relaxations': quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build Bagua, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2 times) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.
Scalable Robust Graph Embedding with Spark [Download Paper] Chi Thang Duong (Ecole Polytechnique Federale de Lausanne)*, Dung Trung Hoang (Hanoi University of Science and Technology), Hongzhi Yin (The University of Queensland), Matthias Weidlich (Humboldt-Universit?t zu Berlin), Quoc Viet Hung Nguyen (Griffith University), Karl Aberer (EPFL) Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not scale well to large graphs. While several techniques to scale graph embedding using compute clusters have been proposed, they require continuous communication between the compute nodes and cannot handle node failure. We therefore propose a framework for scalable and robust graph embedding based on the MapReduce model, which can distribute any existing embedding technique. Our method splits a graph into subgraphs to learn their embeddings in isolation and subsequently reconciles the embedding spaces derived for the subgraphs. We realize this idea through a novel distributed graph decomposition algorithm. In addition, we show how to implement our framework in Spark to enable efficient learning of effective embeddings. Experimental results illustrate that our approach scales well, while largely maintaining the embedding quality.
Deep Indexed Active Learning for Matching Heterogeneous Entity Representations [Download Paper] Arjit Jain (Indian Institute of Technology Bombay)*, Sunita Sarawagi (IIT Bombay), Prithviraj Sen (IBM Almaden Research Center) Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.
QueryFormer: A Tree Transformer Model for Query Plan Representation [Download Paper] Yue Zhao (Nanyang Technological University)*, Gao Cong (Nanyang Technological Univesity), Jiachen Shi (Nanyang Technological University), Chunyan Miao (NTU) Machine learning has become a prominent method in many database optimization problems such as cost estimation, index selection and query optimization. Translating query execution plans into their vectorized representations is non-trivial. Recently, several query plan representation methods have been proposed. However, they have two limitations. First, they do not fully utilize readily available database statistics in the representation, which characterizes the data distribution. Second, they typically have difficulty in modeling long paths of information flow in a query plan, and capturing parent-children dependency between operators. To tackle these limitations, we propose QueryFormer, a learning-based query plan representation model with a tree-structured Transformer architecture. In particular, we propose a novel scheme to integrate histograms obtained from database systems into query plan encoding. In addition, to effectively capture the information flow following the tree structure of a query plan, we develop a tree-structured model with the attention mechanism. We integrate QueryFormer into four machine learning models, each for a database optimization task, and experimental results show that QueryFormer is able to improve performance of these models significantly.
Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy [Download Paper] Ergute Bao (national university of singapore)*, Yizheng Zhu (National University of Singapore), Xiaokui Xiao (National University of Singapore), Yin Yang (Hamad bin Khalifa University), Beng Chin Ooi (NUS), Benjamin Tan (Institute for Infocomm Research), Khin Mi Mi Aung (ASTAR) Deep neural networks have strong capabilities of memorizing the underlying training data, which can be a serious privacy concern. An effective solution to this problem is to train models with \textit{differential privacy} (\textit{DP}), which provides rigorous privacy guarantees by injecting random noise to the gradients. This paper focuses on the scenario where sensitive data are distributed among multiple participants, who jointly train a model through \textit{federated learning}, using both \textit{secure multiparty computation} (\textit{MPC}) to ensure the confidentiality of each gradient update, and differential privacy to avoid data leakage in the resulting model. A major challenge in this setting is that common mechanisms for enforcing DP in deep learning, which inject \textit{real-valued noise}, are fundamentally incompatible with MPC, which exchanges \textit{finite-field integers} among the participants. Consequently, most existing DP mechanisms require rather high noise levels, leading to poor model utility. Motivated by this, we propose \textit{Skellam mixture mechanism} ({\sf SMM}), a novel approach to enforcing DP on models built via federated learning. Compared to existing methods, {\sf SMM} eliminates the assumption that the input gradients must be integer-valued, and, thus, reduces the amount of noise injected to preserve DP. Further, {\sf SMM} allows tight privacy accounting due to the nice composition and sub-sampling properties of the Skellam distribution, which are key to accurate deep learning with DP. The theoretical analysis of {\sf SMM} is highly non-trivial, especially considering (i) the complicated math of differentially private deep learning in general and (ii) the fact that the mixture of two Skellam distributions is rather complex, and to our knowledge, has not been studied in the DP literature. Extensive experiments on various practical settings demonstrate that {\sf SMM} consistently and significantly outperforms existing solutions in terms of the utility of the resulting model.
Retrofitting GDPR Compliance onto Legacy Databases [Download Paper] Archita Agarwal (Brown University)*, Marilyn George (Brown University), Aaron R Jeyaraj (Brown University), Malte Schwarzkopf (Brown University) New privacy laws like the European Union's General Data Protection Regulation (GDPR) require database administrators (DBAs) to identify all information related to an individual on request, e.g., to return or delete it. This requires time-consuming manual labor today, particularly for legacy schemas and applications. In this paper, we investigate what it takes to provide mostly-automated tools that assist DBAs in GDPR-compliant data extraction for legacy databases. We find that a combination of techniques is needed to realize a tool that works for the databases of real-world applications, such as web applications, which may violate strict normal forms or encode data relationships in bespoke ways. Our tool, GDPRizer, relies on foreign keys, query logs that identify implied relationships, data driven methods, and coarse-grained annotations provided by the DBA to extract an individual's data. In a case study with three popular web applications, GDPRizer achieves 100% precision and 96--100% recall. GDPRizer saves work compared to hand-written queries, and while manual verification of its outputs is required, GDPRizer simplifies privacy compliance.
Scalar DL: Scalable and Practical Byzantine Fault Detection for Transactional Database Systems [Download Paper] Hiroyuki Yamada (Scalar, Inc.)*, Jun Nemoto (Scalar, Inc.) This paper presents Scalar DL, a Byzantine fault detection (BFD) middleware for transactional database systems. Scalar DL manages two separately administered database replicas in a database system and can detect Byzantine faults in the database system as long as either replica is honest (not faulty). Unlike previous BFD works, Scalar DL executes non-conflicting transactions in parallel while preserving a correctness guarantee. Moreover, Scalar DL is database-agnostic middleware so that it achieves the detection capability in a database system without either modifying the databases or using database-specific mechanisms. Experimental results with YCSB and TPC-C show that Scalar DL outperforms a state-of-the-art BFD system by 3.5 to 10.6 times in throughput and works effectively on multiple database implementations. We also show that Scalar DL achieves near-linear (91%) scalability when the number of nodes composing each replica increases.
AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data [Download Paper] Ryan Mckenna (University of Massachusetts, Amherst)*, Brett Mullins (University of Massachusetts), Daniel Sheldon (University of Massachusetts, Amherst), Gerome Miklau (University of Massachusetts Amherst) We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm, within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability, which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
Hybrid Blockchain Database Systems: Design and Performance [Download Paper] Zerui Ge (National University of Singapore), Dumitrel Loghin (National University of Singapore)*, Beng Chin Ooi (NUS), Pingcheng Ruan (National University of Singapore), Tianwen Wang (National University of Singapore) With the emergence of hybrid blockchain database systems, we aim to provide an in-depth analysis of the performance and trade-offs among a few representative systems. To achieve this goal, we implement Veritas and BlockchainDB from scratch. For Veritas, we provide two flavors to target the crash fault-tolerant (CFT) and Byzantine fault-tolerant (BFT) application scenarios. Specifically, we implement Veritas with Apache Kafka to target CFT application scenarios, and Veritas with Tendermint to target BFT application scenarios. We compare these three systems with the existing open-source implementation of BigchainDB. BigchainDB uses Tendermint for consensus and provides two flavors: a default implementation with blockchain pipelining and an optimized version that includes blockchain pipelining and parallel transaction validation. Our experimental analysis confirms that CFT designs, which are typically used by distributed databases, exhibit much higher performance than BFT designs, which are specific to blockchains. On the other hand, our extensive analysis highlights the variety of design choices faced by the developers and sheds some light on the trade-offs that need to be done when designing a hybrid blockchain database system.
Hu-Fu: Efficient and Secure Spatial Queries over Data Federation [Download Paper] Yongxin Tong (Beihang University)*, Xuchen Pan (Beihang University), Yuxiang Zeng (Hong Kong University of Science and Technology), Yexuan Shi (Beihang University), Chunbo Xue (Beihang University), Zimu Zhou (Singapore Management University), Xiaofei Zhang (University of Memphis), Lei Chen (Hong Kong University of Science and Technology), Yi Xu (Beihang University), Ke Xu (Beihang University), Weifeng Lv (Beihang University) Data isolation has become an obstacle to scale up query processing over big data, since sharing raw data among data owners is often prohibitive due to security concerns. A promising solution is to perform secure queries over a federation of multiple data owners leveraging secure multi-party computation (SMC) techniques, as evidenced by recent federation work over relational data. However, existing solutions are highly inefficient on spatial queries due to excessive secure distance operations for query processing and their usage of general-purpose SMC libraries for secure operation implementation. In this paper, we propose Hu-Fu, the first system for efficient and secure spatial query processing on a data federation. The idea is to decompose the secure processing of a spatial query into as many plaintext operations and as few secure operations as possible, where fewer secure operators are involved and all secure operators are implemented dedicatedly. As a working system, Hu-Fu supports not only query input in native SQL, but also heterogeneous spatial databases (e.g. PostGIS, Simba, GeoMesa, and SpatialHadoop) at the backend. Extensive experiments show that Hu-Fu usually outperforms the state-of-the-arts in running time and communication cost while guaranteeing security.
NBTree: a Lock-free PM-friendly Persistent B+-Tree for eADR-enabled PM Systems [Download Paper] Bowen Zhang (Shanghai Jiao Tong University)*, Shengan Zheng (Shanghai Jiao Tong University), Zhenlin Qi (Shanghai Jiao Tong University), Linpeng Huang (Shanghai Jiao Tong University) Persistent memory (PM) promises near-DRAM performance as well as data persistency. Recently, a new feature called eADR is available on the 2nd generation Intel Optane PM with the 3rd generation Intel Xeon Scalable Processors. eADR ensures that data stored within the CPU caches will be flushed to PM upon the power failure. Thus, in eADR-enabled PM systems, the globally visible data is considered persistent, and explicit data flushes are no longer necessary. The emergence of eADR presents unique opportunities to build lock-free data structures and unleash the full potential of PM. In this paper, we propose NBTree, a lock-free PM-friendly B+-Tree, to deliver high scalability and low PM overhead. To our knowledge, NBTree is the first persistent index designed for eADR-enabled PM systems. To achieve lock-free, NBTree uses atomic primitives to serialize leaf node operations. Moreover, NBTree proposes four novel techniques to enable lock-free access to the leaf during structural modification operations (SMO), including three-phase SMO, sync-on-write, sync-on-read, and cooperative SMO. For inner node operations, we develop a shift-aware search algorithm to resolve read-write conflicts. To reduce PM overhead, NBTree decouples the leaf nodes into a metadata layer and a key-value layer. The metadata layer is stored in DRAM, along with the inner nodes, to reduce PM accesses. NBTree also adopts log-structured insert and in-place update/delete to improve cache utilization. Our evaluation shows that NBTree achieves up to 11¡Á higher throughput and 43¡Á lower 99% tail latency than state-of-the-art persistent B+-Trees under YCSB workloads.
Near-Data Processing in Database Systems on Native Computational Storage under HTAP Workloads [Download Paper] Tobias Vincon (Reutlingen University), Christian Knoedler (Reutlingen University), Leonardo Solis-vasquez (Technical University of Darmstadt), Arthur Bernhardt (Reutlingen University), Sajjad Tamimi (TU Darmstadt), Lukas Weber (TU Darmstadt), Florian Stock (TU Darmstadt), Andreas Koch (TU Darmstadt), Ilia Petrov (Reutlingen University)* Today¡¯s Hybrid Transactional and Analytical Processing (HTAP) systems, tackle the ever-growing data in combination with a mixture of transactional and analytical workloads. While optimizing for aspects such as data freshness and performance isolation, they build on the traditional data-to-code principle and may trigger massive cold data transfers that impair the overall performance and scalability. Firstly, in this paper we show that Near-Data Processing (NDP) naturally fits in the HTAP design space. Secondly, we propose an NDP database architecture, allowing transactionally consistent in- situ executions of analytical operations in HTAP settings. We evaluate the proposed architecture in state-of-the-art key/value-stores and multi-versioned DBMS. In contrast to traditional setups, our approach yields robust, resource- and cost-efficient performance.
Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation [Download Paper] Xinyi Zhang (Peking University), Zhuo Chang (Peking University), Yang Li (Peking University), Hong Wu (Alibaba), Jian Tan (Alibaba), Feifei Li (Alibaba Group), Bin Cui (Peking University)* Recently, using automatic configuration tuning to improve the performance of modern database management systems (DBMSs) has attracted increasing interest from the database community. This is embodied with a number of systems featuring advanced tuning capabilities being developed. However, it remains a challenge to select the best solution for database configuration tuning, considering the large body of algorithm choices. In addition, beyond the applications on database systems, we could find more potential algorithms designed for configuration tuning. To this end, this paper provides a comprehensive evaluation of configuration tuning techniques from a broader perspective, hoping to better benefit the database community. In particular, we summarize three key modules of database configuration tuning systems and conduct extensive ablation studies using various challenging cases. Our evaluation demonstrates that the hyper-parameter optimization algorithms can be borrowed to further enhance the database configuration tuning. Moreover, we identify the best algorithm choices for different modules. Beyond the comprehensive evaluations, we offer an efficient and unified database configuration tuning benchmark via surrogates that reduces the evaluation cost to a minimum, allowing for extensive runs and analysis of new techniques.
TencentCLS: The Cloud Log Service with High Query Performances [Download Paper] [Industry] Muzhi Yu (Peking University)*, Zhaoxiang Lin (tencent), Jinan Sun (Peking University), ZHOU RUNYUN (Tencent Cloud Computing (Beijing) Co., Ltd.), Jiang Guoqiang (tencent), hua huang (tencent), Shikun Zhang (Peking University) With the trend of cloud computing, the cloud log service is becoming increasingly important, as it plays a critical role in tasks such as root cause analysis, service monitoring and security audition. To meet these needs, we provide Tencent Cloud Log Service (TencentCLS), a one-stop solution for log collection, storage, analysis and dumping. It currently hosts more than a million tenants, of which the largest ones can generate up to PB-level logs per day. The most important challenge that TencentCLS faces is to support both low-latency and resource-efficient queries on such large quantities of log data. To address that challenge, we propose a novel search engine based upon Lucene. The system features a novel procedure for querying logs within a time range, an indexing technique for the time field, as well as optimized query algorithms dedicated to multiple critical and common query types. As a result, the search engine at TencentCLS gains significant performance improvements against Lucene. It achieves ~20x performance increase with standard queries, and ~10x performance increase with histogram queries in massive log query scenarios. In addition, TencentCLS also supports storing and querying with microsecond-level time precision, as well as the microsecond-level time order preservation capability.
Prefix Filter: Practically and Theoretically Better Than Bloom [Download Paper] Tomer Even (Tel Aviv University)*, Guy Even (Tel Aviv University), Adam Morrison (Tel Aviv University) Many applications of approximate membership query data structures, or filters, require only an incremental filter that supports insertions but not deletions. However, the design space of incremental filters is missing a ``sweet spot'' filter that combines space efficiency, fast queries, and fast insertions. Incremental filters, such as the Bloom and blocked Bloom filter, are not space efficient. Dynamic filters (i.e., supporting deletions), such as the cuckoo or vector quotient filter, are space efficient but do not exhibit consistently fast insertions and queries. In this paper, we propose the prefix filter, an incremental filter that addresses the above challenge: (1) its space (in bits) is similar to state-of-the-art dynamic filters; (2) query throughput is high and is comparable to that of the cuckoo filter; and (3) insert throughput is high with overall build times faster than those of the vector quotient filter and cuckoo filter by $1.25\times$--$1.33\times$ and $3\times$--$3.2\times$, respectively. We present a rigorous analysis of the prefix filter that holds also for practical set sizes (i.e., $n=2^{25}$). The analysis deals with the probability of failure, false positive rate, and probability that an operation requires accessing more than a single cache line.
DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems [Download Paper] Bailu Ding (Microsoft Research)*, Surajit Chaudhuri (Microsoft), Johannes Gehrke (Microsoft), Vivek Narasayya (Microsoft) We describe a new benchmark, DSB, for evaluating both workload-driven and traditional database systems on modern decision support workloads. DSB is adapted from the widely-used industrial-standard TPC-DS benchmark. It enhances the TPC-DS benchmark with complex data distribution and challenging yet semantically meaningful query templates. DSB also introduces configurable and dynamic workloads to assess the adaptability of database systems. Since workload-driven and traditional database systems have different performance dimensions, including the additional resources required for tuning and maintaining the systems, we provide guidelines on evaluation methodology and metrics to report. We show a case study on how to evaluate both workload-driven and traditional database systems with the DSB benchmark. The code for the DSB benchmark is open sourced and is available at https://aka.ms/dsb.
A Study of Database Performance Sensitivity to Experiment Settings [Download Paper] Yang Wang (The Ohio State University)*, Miao Yu (The Ohio State University), Yujie Hui (The Ohio State University), Fang Zhou (The Ohio State University), Yuyang Huang (Ohio State University), Rui Zhu (The Ohio State University), Xueyuan Ren (The Ohio State University), Tianxi Li (The Ohio State University), Xiaoyi Lu (UC Merced) To allow performance comparison across different systems, our community has developed multiple benchmarks, such as TPC-C and YCSB, which are widely used. However, despite such effort, interpreting and comparing performance numbers is still a challenging task, because one can tune benchmark parameters, system features, and hardware settings, which can lead to very different system behaviors. Such tuning creates a long-standing question of whether the conclusion of a work can hold under a wider range of settings. This work tries to shed light on this question by reproducing 11 works evaluated under TPC-C and YCSB, measuring their performance under a wider range of settings, and investigating the reasons for the change of performance numbers. By doing so, this paper tries to motivate the discussion about whether and how we should address this problem. While this paper does not give a complete solution---this is beyond the scope of a single paper, it proposes concrete suggestions we can take to improve the state of the art.
What Is the Price for Joining Securely? Benchmarking Equi-Joins in Trusted Execution Environments [Download Paper] [Experiment, Analysis & Benchmark Papers] Kajetan Maliszewski (TU Berlin)*, Jorge Arnulfo Quiane Ruiz (TU Berlin), Jonas Traub (TU Berlin), Volker Markl (Technische Universität Berlin) Protection of personal data has been raised to be among the top requirements of modern systems. At the same time, it is now frequent that the owner of the data and the owner of the computing infrastructure are two entities with limited trust between them (e.g., volunteer computing or the hybrid-cloud). Recently, trusted execution environments (TEEs) became a viable solution to ensure the security of systems in such environments. However, the performance of relational operators in TEEs remains an open problem. We conduct a comprehensive experimental study to identify the main bottlenecks and challenges when executing relational equi-joins in TEEs. For this, we introduce TEEbench, a framework for unified benchmarking of relational operators in TEEs, and use it for conducting our experimental evaluation. In a nutshell, we perform the following experimental analysis for eight core join algorithms: off-the-shelf performance; the performance implications of data sealing and obliviousness; sensitivity and scalability. The results show that all eight join algorithms significantly suffer from different performance bottlenecks in TEEs. They can be up to three orders of magnitude slower in TEEs than on plain CPUs. Our study also indicates that existing join algorithms need a complete, hardware-aware redesign to be efficient in TEEs, and that, in secure query plans, managing TEE features is equally important to join selection.
DISTILL: Low-Overhead Data-Driven Techniques for Filtering and Costing Indexes for Scalable Index Tuning [Download Paper] Tarique Siddiqui (Microsoft Research)*, Wentao Wu (Microsoft Research), Vivek Narasayya (Microsoft), Surajit Chaudhuri (Microsoft) Many database systems offer index tuning tools that help automatically select appropriate indexes for improving the performance of an input workload. Index tuning is a resource-intensive and time-consuming task requiring expensive optimizer calls for estimating the cost of queries over potential index configurations. In this work, we develop low-overhead techniques that can be leveraged by index tuning tools for reducing a large number of optimizer calls without making changes to the tuning algorithm or to the query optimizer. First, index tuning tools use rule-based techniques to generate a large number of syntactically-relevant indexes; however, a large proportion of such indexes are spurious and do not lead to a significant improvement in the performance of queries. We eliminate such indexes much earlier in the search by leveraging patterns in the workload, without making optimizer calls. Second, we learn cost models that exploit the similarity between query and index configuration pairs in the workload to efficiently estimate the cost of queries over a large number of index configurations using fewer optimizer calls. We perform an extensive evaluation over both real-world and synthetic benchmarks, and show that given the same set of input queries, indexes, and the search algorithm for exploration, our proposed techniques can lead to a median reduction in tuning time of 3¡Á and a maximum of 12¡Á compared to state-of-the-art tuning tools with similar quality of recommended indexes.
Time Series Data Encoding for Efficient Storage: A Comparative Analysis in Apache IoTDB [Download Paper] [Experiment, Analysis & Benchmark Papers] Jinzhao Xiao (Tsinghua University), Yuxiang Huang (Tsinghua University), Changyu Hu (Tsinghua University), Shaoxu Song (Tsinghua University)*, Xiangdong Huang (Tsinghua University), Jianmin Wang ("Tsinghua University, China") Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, it is not surprising that different en- coding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encod- ing performance, including scale, delta, repeat and increase. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effec- tiveness regarding to various data features is then presented for the studied algorithms. To this end, we develop a benchmark for eval- uating encoding algorithms, including a data generator regarding the aforesaid data features and several real-world datasets from our industrial partners. Finally, we present an extensive experimental evaluation using the benchmark. Remarkably, a quantitative anal- ysis of encoding effectiveness regarding to various data features is conducted in Apache IoTDB.
Efficient Shortest Path Counting on Large Road Networks [Download Paper] Yu-xuan Qiu (University of Technology Sydney)*, Dong Wen (University of New South Wales), Lu Qin (UTS), Wentao Li (University of Technology Sydney), Ronghua Li (Beijing Institute of Technology), Ying Zhang (University of Technology Sydney) The shortest path distance and related concepts lay the foundations of many real-world applications in road network analysis. The shortest path count has drawn much research attention in academia, not only as a closeness metric accompanying the shorted distance but also serving as a building block of centrality computation. This paper aims to improve the efficiency of counting the shortest paths between two query vertices on a large road network. We propose a novel index solution by organizing all vertices in a tree structure and propose several optimizations to speed up the index construction. We conduct extensive experiments on 14 real-world networks. Compared with the state-of-the-art solution, we achieve much higher efficiency on both query processing and index construction with a more compact index.
Identifying Similar-Bicliques in Bipartite Graphs [Download Paper] Kai Yao (The University of Sydney)*, Lijun Chang (The University of Sydney), Jeffrey Xu Yu (Chinese University of Hong Kong) Bipartite graphs have been widely used to model the relationship between entities of different types, where vertices are partitioned into two disjoint sets/sides. Finding dense subgraphs in a bipartite graph is of great significance and encompasses many applications. However, none of the existing dense bipartite subgraph models consider similarity between vertices from the same side, and as a result, the identified results may include vertices that are not similar to each other. In this paper, we formulate the notion of similar-biclique which is a special kind of biclique where all vertices from a designated side are similar to each other, and aim to enumerate all similar-bicliques. The naive approach of first enumerating all maximal bicliques and then extracting all maximal similar-bicliques from them is inefficient, as enumerating maximal bicliques is time consuming. We propose a backtracking algorithm MSBE to directly enumerate maximal similar-bicliques, and power it by vertex reduction and optimization techniques. Furthermore, we design a novel index structure to speed up a time-critical operation of MSBE, as well as to speed up vertex reduction. Efficient index construction algorithms are also developed. Extensive experiments on 17 bipartite graphs as well as case studies are conducted to demonstrate the effectiveness and efficiency of our model and algorithms.
Maximizing Fair Content Spread via Edge Suggestion in Social Networks [Download Paper] Ian Swift (University of Illinois at Chicago)*, Sana Ebrahimi (University of Illinois at Chicago), Azade Nova (Google Brain), Abolfazl Asudeh (University of Illinois at Chicago) Content spread inequity is a potential unfairness issue in online social networks, disparately impacting minority groups. In this paper, we view friendship suggestion, a common feature in social network platforms, as an opportunity to achieve an equitable spread of content. In particular, we propose to suggest a subset of potential edges (currently not existing in the network but likely to be accepted) that maximizes content spread while achieving fairness. Instead of re-engineering the existing systems, our proposal builds a fairness wrapper on top of the existing friendship suggestion components. We prove the problem is NP-hard and inapproximable in polynomial time unless P = NP. Therefore, allowing relaxation of the fairness constraint, we propose an algorithm based on LP-relaxation and randomized rounding with fixed approximation ratios on fairness and content spread. We provide multiple optimizations, further improving the performance of our algorithm in practice. Besides, we propose a scalable algorithm that dynamically adds subsets of nodes, chosen via iterative sampling, and solves smaller problems corresponding to these nodes.. Besides theoretical analysis, we conduct comprehensive experiments on real and synthetic data sets. Across different settings, our algorithms found solutions with near zero unfairness while significantly increasing the content spread. Our scalable algorithm could process a graph with half a million nodes on a single machine, reducing the unfairness to around 0.0004 while lifting content spread by 43%.
Migrating Social Event Recommendation Over Microblogs [Download Paper] Xiangmin Zhou (RMIT University)*, Lei Chen (Hong Kong University of Science and Technology) Real applications like crisis management require the real time awareness of the critical situations. However, the services using traditional methods like phone calls can be easily delayed due to busy lines, transfer delays or limited communication ability in disaster areas. Existing social event analysis solutions enhanced the situation awareness of systems. Unfortunately, they cannot recognize the complex migrating social events that are first observed in social media at a specific time, place and state, but have further moved in space and time, which may affect the comprehension of the system. While the discussion on events appears in microblogs, their movement over different contexts is unavoidable. So far, the problem of migrating social event analysis from big media is not well investigated yet. To address this issue, we propose a novel framework to monitor and deliver the migrating events in big social media data, which fully exploits the information of social media over multiple attributes and their inherent interactions among events. Specifically, we first propose a Concept TF/IDF model to capture the content that is constrained by the time and location of social posts without costly learning process. Then, we construct a novel Maximal User Influence Graph (MUIG) to extract the social interactions. With MUIG, the event migrations over space and time are well identified. Finally, we design efficient query strategies over Apache Spark for recommending events in real time. Extensive tests over big media are conducted to prove the high effectiveness and efficiency of our approach.
Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching [Download Paper] Yeonhong Park (Seoul National University)*, Sunhong Min (Seoul National University), Jae W. Lee (Seoul National University) Recently, Graph Neural Networks (GNNs) have been receiving a spotlight as a powerful tool that can effectively serve various inference tasks on graph structured data. As the size of real-world graphs continues to scale, the GNN training system faces a scalability challenge. Distributed training is a popular approach to address this challenge by scaling out CPU nodes. However, not much attention has been paid to disk-based GNN training, which can scale up the single-node system in a more cost-effective manner by leveraging high-performance storage devices like NVMe SSDs. We observe that the data movement between the main memory and the disk is the primary bottleneck in the SSD-based training system, and that the conventional GNN training pipeline is sub-optimal without taking this overhead into account. Thus, we propose Ginex, the first SSD-based GNN training system that can process billion-scale graph datasets on a single machine. Inspired by the inspector-executor execution model in compiler optimization, Ginex restructures the GNN training pipeline by separating sample and gather stages. This separation enables Ginex to realize a provably optimal replacement algorithm, known as Belady's algorithm, for caching feature vectors in memory, which account for the dominant portion of I/O accesses. According to our evaluation with four billion-scale graph datasets, Ginex achieves 2.11x higher training throughput on average (up to 2.67x at maximum) than the SSD-extended PyTorch Geometric.
Sortledton: a universal, transactional graph data structure [Download Paper] [Best Regular Research Paper Runner Ups] Per Fuchs (Technische Universität München)*, Jana Giceva (TU Munich), Domagoj Margan (Imperial College London) Despite the wide adoption of graph processing across many differ- ent application domains, there is no underlying data structure that can serve a variety of graph workloads (analytics, traversals, and pattern matching) on dynamic graphs with transactional updates. In this paper, we present Sortledton, a universal graph data struc- ture that addresses the open problem by being carefully optimizing for the most relevant data access patterns used by graph computa- tion kernels. It can support millions of transactional updates per second, while providing competitive performance (1.22x on av- erage) for the most common graph workloads to the best-known baseline for static graphs ¨C csr. With this, we improve the ingestion throughput over state-of-the-art dynamic graph data structures, while supporting a wider range of graph computations under trans- actional guarantees, with a much simpler design and significantly smaller memory footprint (2.1x that of csr).
Succinct Graph Representations as Distance Oracles: An Experimental Evaluation [Download Paper] Arpit Merchant (University of Helsinki)*, Aristides Gionis (KTH Royal Institute of Technology), Michael Mathioudakis (University of Helsinki) Distance oracles answer shortest-path queries between any pair of nodes in a graph. They are often built using succinct graph repre- sentations such as spanners, sketches, and compressors to minimize oracle size and query answering latency. Node embeddings, in par- ticular, offer graph representations that place similar nodes nearby in a low-rank space. However, their use in distance oracles has not been sufficiently studied and compared to other respresentations. In this paper, we compare experimentally different distance ora- cles that are based on a variety of node embeddings and other graph representations. The evaluation focuses on exact distance oracles and is made in terms of relevant measures of efficiency, i.e., con- struction time, memory requirements, and query-processing time. It is conducted over fourteen real-world graph datasets and four syn- thetic graph families. Our findings suggest that distances between node embeddings are excellent estimators of graph distances when graphs are well-structured, for instance, when they are regular, or have high clustering coefficient and density. Moreover, depending on the embedding algorithm, their construction is up to 19 times faster than multi-dimensional scaling, they require up to 2 times less memory than approximate distance-preserving data structures, up to 23 times less processing time than compressed indexes, and are exact up to 1.7 times more than spanners. Finally, while the exactness of distance oracles is infeasible to maintain for huge graphs even under large amounts of resources, we find experimentally that GOSH, a parallelized implementation of spectral embedding, scales to graphs of 100M nodes with little loss on accuracy.
Edge-based Local Push for Personalized PageRank [Download Paper] Hanzhi Wang (Renmin University of China)*, Zhewei Wei (Renmin University of China), Junhao Gan (University of Melbourne), Ye Yuan ( Beijing Institute of Technology), Xiaoyong Du (Renmin University of China), Ji-rong Wen (Renmin University of China) Personalized PageRank (PPR) is a popular node proximity metric in graph mining and network research. A single-source PPR (SSPPR) query asks for the PPR value of each node on the graph. Due to its importance and wide applications, decades of efforts have been devoted to the efficient processing of SSPPR queries. Among existing algorithms, LocalPush is a fundamental method for SSPPR queries and serves as a cornerstone for subsequent algorithms. In LocalPush, a push operation is a crucial primitive operation, which distributes the probability at a node $u$ to ALL $u$'s neighbors via the corresponding edges. Although this push operation works well on unweighted graphs, unfortunately, it can be rather inefficient on weighted graphs. In particular, on unbalanced weighted graphs where only a few of these edges take the majority of the total weight among them, the push operation would have to distribute {``insignificant''} probabilities along those edges which just take the minor weights, resulting in expensive overhead. To resolve this issue, in this paper, we propose the EdgePush algorithm, a novel method for computing SSPPR queries on weighted graphs. EdgePush decomposes the aforementioned push operations in edge-based push, allowing the algorithm to operate at the edge level granularity. As a result, it can flexibly distribute the probabilities according to edge weights. Furthermore, our EdgePush allows a fine-grained termination threshold for each individual edge, leading to a superior complexity over LocalPush. Notably, we prove that EdgePush improves the theoretical query cost of LocalPush by an order of up to $O(n)$ when the graph's weights are unbalanced. Our experimental results demonstrate that EdgePush significantly outperforms state-of-the-art baselines in terms of query efficiency on large motif-based and real-world weighted graphs.
A Near-Optimal Approach to Edge Connectivity-Based Hierarchical Graph Decomposition [Download Paper] Lijun Chang (The University of Sydney)*, Zhiyi Wang (The University of Sydney) Driven by applications in graph analytics, the problem of efficiently computing all k-edge connected components (k-ECCs) of a graph G for a user-given k has been extensively and well studied. It is known that the k-ECCs of G for all possible values of k form a hierarchical structure. In this paper, we study the problem of efficiently constructing the hierarchy tree for G which compactly encodes the k-ECCs for all possible k values in space linear to the number of vertices n. All existing approaches construct the hierarchy tree in $O(\delta(G)\times T_{KECC}(G) )$ time, where $\delta(G)$ is the degeneracy of G and $T_{KECC}(G)$ is the time complexity of computing all k-ECCs of G for a specific k value. To improve the time complexity, we propose a divide-and-conquer approach running in $O( (\log\delta(G))\times T_{KECC}(G))$ time, which is optimal up to a logarithmic factor. However, a straightforward implementation of our algorithm would result in a space complexity of $O( (m + n) \log \delta(G))$. As main memory also becomes a scarce resource when processing large-scale graphs, we further propose techniques to optimize the space complexity to $2m+O(n\log \delta(G))$, where m is the number of edges in G. Extensive experiments on large real graphs and synthetic graphs demonstrate that our approach outperforms the state-of-the-art approaches by up to 28 times in terms of running time, and by up to 8 times in terms of main memory usage. As a by-product, we also improve the space complexity of computing all k-ECCs for a specific k to $2m + O(n)$.
Parallel Training of Knowledge Graph Embedding Models: A Comparison of Techniques [Download Paper] [Experiments, Analyses & Benchmarks] Adrian Kochsiek (University of Mannheim)*, Rainer Gemulla (Universität Mannheim) Knowledge graph embedding (KGE) models represent the entities and relations of a knowledge graph (KG) using dense continuous representations called embeddings. KGE methods have recently gained traction for tasks such as knowledge graph completion and reasoning as well as to provide suitable entity representations for downstream learning tasks. While a large part of the available literature focuses on small KGs, a number of frameworks that are able to train KGE models for large-scale KGs by parallelization across multiple GPUs or machines have recently been proposed. So far, the benefits and drawbacks of the various parallelization techniques have not been studied comprehensively. In this paper, we report on an experimental study in which we presented, re-implemented in a common computational framework, investigated, and improved the available techniques. We found that the evaluation methodologies used in prior work are often not comparable and can be misleading, and that most of currently implemented training methods tend to have a negative impact on embedding quality. We propose a simple but effective variation of the stratification technique used by PyTorch BigGraph for mitigation. Moreover, basic random partitioning can be an effective or even the best-performing choice when combined with suitable sampling techniques. Ultimately, we found that efficient and effective parallel training of large-scale KGE models is indeed achievable but requires a careful choice of techniques.