SSEV: Self-Refinement & Ensemble Voting for Text-to-SQL
- The paper introduces a novel mechanism that combines single-agent self-refinement with ensemble voting to enhance Text-to-SQL robustness.
- It iteratively improves SQL queries using execution feedback, correcting syntactic and semantic errors until a valid output is achieved.
- Empirical evaluations demonstrate that SSEV achieves SOTA or near-SOTA results on benchmarks like Spider and BIRD while ensuring scalable inference.
Single-Agent Self-Refinement with Ensemble Voting (SSEV) is an inference-time methodology for improving robustness and execution accuracy in Text-to-SQL agents. The approach integrates execution-guided iterative refinement and an ensemble voting mechanism, supporting scalable, production-grade natural language semantic parsing over structured databases. SSEV has been embedded in leading frameworks including ReFoRCE (Deng et al., 2 Feb 2025) and PET-SQL derivatives (Yang et al., 25 Jan 2026), and achieves SOTA or near-SOTA results on benchmarks such as Spider 2.0-Lite and BIRD.
1. Conceptual Foundation
SSEV is motivated by the high variance and instability often observed in single-pass LLM-based SQL generation under real-world constraints: large database schemas, SQL dialect variation, and ambiguous user queries. The methodology is founded on two principles:
- Single-agent self-refinement: Iteratively repair and redirect candidate SQL queries in response to execution feedback (both syntactic and semantic), mediated by LLM prompting. This loop continues until a valid, consistent output is obtained or a maximal number of refinements is reached.
- Ensemble voting: Parallel execution of multiple independent self-refinement threads (or, in multi-expert extensions, candidate LLMs) whose outputs are aggregated via statistically rigorous voting schemes (majority, weighted majority, randomized weighted majority).
This hybridization addresses two critical failure modes: brittle one-shot prediction and the lack of reliable agent self-critique in the absence of gold-standard data at inference time (Deng et al., 2 Feb 2025, Yang et al., 25 Jan 2026).
2. Detailed Methodology
SSEV separates into two mutually reinforcing modules: self-refinement and ensemble voting. Below, "Editor’s term" is used for operational clarity.
2.1 Self-Refinement Module
At each iteration , the agent maintains where is a SQL candidate and . If is invalid due to a syntax, type, or semantic error, the LLM receives an error-augmented prompt and is tasked to produce correcting failure causes:
where encodes the error message and schema context, and enforces the required output form. This procedure is iterated up to steps. Termination occurs when self-consistency is detected—i.e., a non-error, non-empty is seen two or more times—or by reaching without convergence (Deng et al., 2 Feb 2025).
2.2 Ensemble Voting Mechanism
Multiple self-refinement threads (typically –$8$), each initialized with identical inputs but stochastic LLM behavior, are executed in parallel. Upon completion:
- Each thread outputs its final , where is the execution result or if failure.
- All results are tallied and the consensus answer is chosen:
A strict majority threshold is required to issue ; otherwise, ambiguity is declared and the example is deferred (Deng et al., 2 Feb 2025). In multi-expert setups, deterministic or randomized Weighted Majority Algorithms (WMA/RWMA) compute weights , aggregate per-candidate votes , and select , enabling adaptive expert specialization without ground-truth supervision (Yang et al., 25 Jan 2026).
3. Mathematical Formulation and Execution
The key mathematical operations in SSEV are as follows:
| Symbol/Step | Role | Source |
|---|---|---|
| Iterative error repair | (Deng et al., 2 Feb 2025) | |
| Candidate weight tally | (Yang et al., 25 Jan 2026) | |
| Online weight update | (Yang et al., 25 Jan 2026) | |
| WMA mistake bound | (Yang et al., 25 Jan 2026) | |
| RWMA expected bound | (Yang et al., 25 Jan 2026) | |
| threads producing | Answer confidence metric | (Deng et al., 2 Feb 2025) |
Self-refinement in SSEV is not based on loss backpropagation but on minimizing execution failure rates via error-guided LLM feedback. Ensemble voting employs online-learning-style regret minimization to approximate oracle selection across multiple candidate generators without the need for reference SQL.
4. Prompt Engineering and Schema Linking
The efficacy of SSEV is critically dependent on rigorous prompt construction and schema representation:
- PreSQL prompts: Present the full database DDL, sample values, and relevant task instructions (e.g., “Minimize SQL execution time”), often with few-shot demonstrations selected by embedding similarity.
- PostSQL prompts: Restrict schema context to only the columns/tables linked in the PreSQL phase.
- Self-refinement prompts: Supply the original question, preceding SQL, execution error message/type, correction strategies (e.g., simplify clause, adjust join), and strict output format constraints (e.g., “Action: BIGQUERY_EXEC_SQL(sql_query=...)”).
- Schema linking: Use regex-based DDL parsing or explicit mentions to minimize prompt length and focus the LLM on relevant schema fragments (Yang et al., 25 Jan 2026).
For large or complex schemas, schema compression and context pruning are necessary to mitigate LLM context window limitations, a feature implemented in ReFoRCE (Deng et al., 2 Feb 2025).
5. Empirical Evaluation and Ablation Studies
SSEV exhibits measurable gains on several datasets and under multiple LLM backbones. Notably:
| Dataset | SSEV (WMA) EX | Best Individual EX | Gain | Source |
|---|---|---|---|---|
| Spider 1.0-Dev | 85.5% | 84.9% (Gemini) | +0.6% | (Yang et al., 25 Jan 2026) |
| Spider 1.0-Test | 86.4% | 85.7% (Gemini) | +0.7% | (Yang et al., 25 Jan 2026) |
| BIRD-Dev | 66.3% | 65.97% | +0.33% | (Yang et al., 25 Jan 2026) |
| Spider 2.0-Snow | 26.7% (W=3) | ~20% (single run) | +6.7% | (Deng et al., 2 Feb 2025) |
Ablation studies confirm that:
- Self-refinement contributes more substantially on complex/real-world data and with less capable LLMs.
- Limiting refinement to –3 rounds suffices for maximal utility–most recoverable errors are corrected early.
- Schema linking and output format enforcement further regularize agent predictions, especially in high-column-count settings (Yang et al., 25 Jan 2026).
6. Implementation and Practical Considerations
SSEV has minimal deployment cost increase over naive LLM inference due to parallelization and bounded refinement loops. Each thread incurs up to LLM calls and SQL executions, and all candidates are batched for voting. Production notes include:
- LLM ensemble: APIs (e.g., Llama-3-70B, Gemini-2.5, Qwen2.5) (Yang et al., 25 Jan 2026).
- Execution: Pipelines modularized in Python, LLM calls distributed, embeddings cached, and results persisted for bootstrapping.
- Confidence deferral: Results below confidence threshold are flagged as “Ambiguous” for human or further computational review (Deng et al., 2 Feb 2025).
A plausible implication is that SSEV preserves inference latency under sufficient hardware parallelism and enables practical deployment on datasets with large, cross-dialect schema variance.
7. Extensions, Limitations, and Future Directions
While SSEV is robust for single-hop query construction and schema-centric Text-to-SQL, it does not natively support multi-step reasoning or leverage external world knowledge. Recent work proposes ReCAPAgent-SQL, which decomposes the refinement-critique-act-plan cycle over specialized LLM agents. Moreover, future enhancements may include:
- Integrating semantic self-critique in addition to execution feedback.
- Calibrating ensemble weights dynamically per domain or user profile.
- Addressing prompt context overflow via progressive schema partitioning (Yang et al., 25 Jan 2026).
Overall, SSEV exemplifies a class of methods coupling LLM self-improvement with statistically grounded consensus estimation, demonstrably raising the reliability ceiling of scalable, industry-strength semantic parsing agents (Deng et al., 2 Feb 2025, Yang et al., 25 Jan 2026).