go back
go back
Volume 18, No. 9
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
Abstract
Analyzing unstructured data has been a persistent challenge in data processing. Recent proposals o!er declarative frameworks for LLMpowered processing of unstructured data, but they typically execute user-specified operations as-is in a single LLM call—focusing on cost rather than accuracy. This is problematic for complex tasks, where even well-prompted LLMs can miss relevant information. For instance, reliably extracting all instances of a specific clause from legal documents often requires decomposing the task, the data, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives ), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of LLM execution. Across four real-world document processing tasks, DocETL improves accuracy by 21–80% over strong baselines. DocETL is open-source at docetl.org and, as of March 2025, has over 1.7k GitHub stars across diverse domains.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy