BigVectorBench: Heterogeneous Data Embedding and Compound Queries are Essential in Evaluating Vector Databases

Authors:

Guoxin Kang, Zhongxin Ge, Jingpei Hu, Xueya Zhang, Lei Wang, Jianfeng Zhan

Download PDF

Abstract

Vector databases are designed to eectively store, organize, and re- trieve high-dimensional vectors, enabling faster and more accurate querying and analysis. This study highlights that the performance of cutting-edge vector databases hinges on their prociency in managing heterogeneous data embedding and handling compound queries. The former task revolves around converting varied data types into a cohesive vector format, while the latter involves pro- cessing multimodal or single-modal queries with precise constraints. The paper advocates for evaluating these dual tasks within an in- tegrated benchmark framework. However, state-of-the-art vector database benchmarks overlook heterogeneous data embedding and compound queries, creating a gap in evaluating vector database performance. To address this gap, we introduce BigVectorBench, a benchmark suite designed to evaluate vector database performance. BigVec- torBench contributes by dening and evaluating the embedding performance of heterogeneous data. Additionally, it abstracts com- pound queries, which are increasingly used in real-world appli- cations, replacing unimodal vector searches. Our rigorous evalu- ations validate the two design decisions of BigVectorBench and identify performance bottlenecks of mainstream vector databases. Its source code and user manual are available from https://github. com/BenchCouncil/BigVectorBench.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 5

BigVectorBench: Heterogeneous Data Embedding and Compound Queries are Essential in Evaluating Vector Databases

Abstract