Semantic Operators and Their Optimization: Towards AI-Based Data Analytics with Accuracy Guarantees

Authors:

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, Matei Zaharia

Download PDF

Abstract

The semantic capabilities of large language models (LLMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems either empirically optimize expensive LLM-powered operations with no performance guarantees , or limit their support to simple batched-inference primitives. We introduce semantic operators , the rst formalism with statistical accuracy guarantees for general-purpose AI-based operations with natural language parameters (e.g., ltering, sorting, joining or aggregating records using natural language criteria). Each operator can be implemented by multiple AI algorithms , which compose individual model invocations to orchestrate the model over the data. Our programming model species the expected behavior of each operator with a high-quality reference algorithm , and we develop an optimization framework that reduces cost, while providing accuracy guarantees for individual operators. Using this approach, we propose several novel optimizations to accelerate semantic ltering, joining, group-by and top-k operations by up to 1 , 000 ⇥ . We implement semantic operators in the LOTUS system and demonstrate LOTUS’ eectiveness on real, bulk-semantic processing applications, including fact-checking, biomedical multilabel classication, search, and topic analysis. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that match or exceed quality of recent LLM-based analytic systems by up to 170%, while oering accuracy guarantees. Overall, LOTUS programs match or exceed the accuracy of state-ofthe-art AI pipelines for each task while running up to 3 . 6 ⇥ faster than the highest-quality baselines. LOTUS is publicly available at https://github.com/lotus-data/lotus.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 11

Semantic Operators and Their Optimization: Towards AI-Based Data Analytics with Accuracy Guarantees

Abstract