ArXiv

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

Authors
James Petullo, Nianwen Xue
Categories
cs.CL, cs.AI
arXiv
https://arxiv.org/abs/2605.08057v1
PDF
https://arxiv.org/pdf/2605.08057v1

Brief

CA-SQL is a complexity-aware Text-to-SQL inference pipeline that scales exploration breadth by estimated task difficulty, employs evolutionary-search-inspired prompt seeding to elicit diverse candidates, and uses a novel voting method to pick final queries. On the Bird-Bench development set it reports 51.72% on the "challenging" tier (SOTA among in-context approaches with GPT-4o-mini), plus 61.06% execution accuracy and 68.77% Soft F1.

Why it matters

CA-SQL achieves a state-of-the-art 51.72% execution accuracy on the "challenging" tier of the Bird-Bench (BIRD) development set using only GPT-4o-mini (reported 2026-05-08).

Key details

  • The pipeline uses complexity-aware, difficulty-scaled exploration, evolutionary-search-inspired prompt seeding, and a novel voting selector; overall results on BIRD dev are 61.06% execution accuracy and 68.77% Soft F1.
Source evidence

Abstract

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.