Text-to-SQL Benchmarks are Broken: An In-Depth Analysis of Annotation Errors
Abstract
Text-to-SQL has been widely studied in both academia and industry. Researchers have developed a series of benchmarks to evaluate different techniques and provide insights for further improvement. However, existing text-to-SQL datasets contain substantial annotation errors, ranging from incorrect ground-truth to ambiguous questions, compromising the reliability of their results. In this work, we present a comprehensive analysis of two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and find error rates of 52.8% and 66.1%, respectively. By re-evaluating five leading open-source methods from the BIRD leaderboard on our corrected benchmark, we observed performance changes ranging from −3% to 31% in relative terms. This results in notable shifts in their performance ranking, with changes of up to three positions. The significant changes in performance and ranking highlight the unreliability of current text-to-SQL benchmarks. We advocate for the development of higher-quality text-to-SQL benchmarks and more effective annotation pipelines.