Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Published 23 Jun 2025 in cs.DB and cs.AI | (2506.18951v2)

Abstract: Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current LLMs, while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

Summary

  • The paper introduces novel benchmarks, including BIRD-CRITIC, and training environments like Six-Gym to address SQL debugging with LLMs.
  • It employs innovative strategies such as f-Plan Boosting and SQL-Rewind to create executable datasets that enhance debugging performance.
  • Bird-Fixer, an open-source SQL debugger built on Qwen-2.5-Coder-14B, demonstrates significant improvements over proprietary models in practical SQL tasks.

"SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications"

Introduction

The paper addresses the challenge of resolving complex SQL issues in real-world applications using LLMs. These models have previously shown success in text-to-SQL translation but have not been extensively evaluated for debugging SQL issues. To address this, the authors introduce BIRD-CRITIC, a benchmark containing PostgreSQL and multi-dialect SQL tasks, derived from authentic user issues. The paper further proposes Six-Gym, an environment for training open-source models to enhance SQL debugging capabilities, alongside novel methodologies such as ff-Plan Boosting.

BIRD-CRITIC Benchmark

BIRD-CRITIC is designed to analyze LLM capabilities in SQL issue resolution. It comprises 530 PostgreSQL-specific tasks and 570 multi-dialect tasks. Each task is grounded in real-world issues recreated in controlled environments to eliminate data contamination. Success rates on BIRD-CRITIC reveal the complexity of debugging tasks, with the best-performing model achieving a modest success rate, highlighting the need for further advancements. Figure 1

Figure 1: Illustration of the SQL issue debugging process in BIRD-CRITIC.

Six-Gym: Automated SQL Debugging Environment

The Six-Gym environment is built to train SQL issue debugging models effectively. It employs the SQL-Rewind strategy, which creates executable datasets by reverse-engineering issues from correct SQL solutions. This strategy enables the generation of vast training data without manual annotation, crucial for training robust models. The ff-Plan Boosting method further enhances this by creating detailed debugging plans from SQL solutions, significantly increasing successful training trajectories. Figure 2

Figure 2: Example task structure within the BIRD-CRITIC benchmark.

Bird-Fixer: An Open-Source SQL Debugger

Bird-Fixer is an open-source agent built using Qwen-2.5-Coder-14B, which utilizes the SQL-act agent scaffold. Unlike traditional tool-based agents, SQL-act treats SQL commands as actions, facilitating richer debugging strategies. By exploiting the Six-Gym environment and ff-plan boosting, Bird-Fixer achieves significant performance improvements over existing proprietary models in SQL debugging tasks.

Implementation Strategies

The paper discusses various implementation strategies, including trajectory-based fine-tuning and the use of ff-Plan Boosting for extracting high-level debugging plans. These strategies emphasize efficiency and accuracy in training models for SQL debugging. Bird-Fixer's deployment as an open-source solution highlights its potential for democratizing sophisticated SQL problem-solving capabilities. Figure 3

Figure 3: Distribution of issue categories in all BIRD-CRITIC, indicating real-world SQL application usage.

Experimental Results

Experiments demonstrate that Bird-Fixer achieves comparable success rates to proprietary models, with significant improvements attributed to the ff-plan methodology. These results showcase Bird-Fixer as a viable alternative for SQL debugging, with substantial implications for research and practical applications in data privacy-sensitive environments. Figure 4

Figure 4: LLM agent performance for BIRD-CRITIC-PG, comparing SQL action approaches.

Conclusion

The paper introduces critical innovations in SQL issue debugging, showcasing the potential of open-source LLMs enhanced by specialized training environments and methodologies. Bird-Fixer, supported by BIRD-CRITIC benchmarking, highlights significant strides toward accessible SQL debugging tools. Future work might explore multi-turn interactions and workflow-integrated debugging to further align with real-world applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.