Data Partitioning for In-Memory Systems: Myths, Challenges, and Opportunities
Abstract
Data partitioning is an important primitive for in-memory data processing systems, and in many cases it is the key performance bottleneck. This important primitive has been the focus of many studies in the past. However, as we argue in this paper, these previous studies have been narrow in their scope leaving many unanswered questions that are of paramount importance in practice. Consequently, to the best of our knowledge, there is no clear answer to the seemingly simple question of what is an efficient partitioning strategy for inmemory data systems. In this paper, we carefully consider this data partitioning primitive in the context of multi-core in-memory data settings. We look at past work in this area and note that many of these studies miss looking at many important aspects such as the impact of tuple size and the impact of the data formats (e.g. rowstore vs. columnar-store). We build on this initial observation and examine a number of data partitioning strategies, leading to a better understanding of how data partitioning methods perform on modern multi-core large memory systems. We note a few interesting observations, including how relatively simple methods work quite well in practice across a broad spectrum of data parameters. To help future researchers, we propose a partitioning benchmark so that work in this area can take a broader and more realistic perspective when working on data partitioning methods. Overall, the key contribution of this paper is to separate the wheat from the chaff in previous research in this area, analyze the relative performance of various methods on a broad set of data parameters, and help provide a more systematic evaluation framework for future work in this area.