Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Published 27 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.23804v2)

Abstract: While LLMs achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named "sub-clause frequency" (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a calibration approach that leverages sub-clause frequencies with multivariate Platt scaling to improve LLM uncertainty estimation for SQL queries.
Experiments on SPIDER and BIRD datasets show significant enhancements in calibration metrics such as Brier score, ECE, and AUC compared to traditional methods.
The method effectively detects errors in structured SQL queries, thereby increasing transparency and reliability in text-to-SQL systems.

Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Introduction

The paper "Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies" (2505.23804) addresses the challenge of providing calibrated confidence scores for the correctness of SQL queries produced by LLMs. Text-to-SQL parsing involves converting natural language questions into structured SQL queries, which is vital for accessing large-scale structured databases without needing expertise in SQL syntax. Despite the proficiency of LLMs in this task, they often yield overconfident erroneous outputs. Thus, calibrated confidence measures are crucial for deploying trustworthy text-to-SQL systems.

The paper establishes a benchmark for post-hoc calibration methods applied to LLM-based text-to-SQL tasks. Platt scaling is utilized initially, delivering notable calibration improvements over raw model probabilities. The authors introduced a novel calibration method using sub-clause frequency (SCF) scores combined with a multivariate Platt scaling (MPS) framework. This approach exploits the structured nature of SQL, where individual sub-clause frequencies provide more granular correctness signals. Experiments on SPIDER and BIRD datasets demonstrate that this method significantly enhances calibration and error detection performance compared to traditional Platt scaling.

Figure 1: An example question from the SPIDER dataset and an output produced by T5 3B. Correct and incorrect sub-clauses are indicated, with corresponding sub-clause frequency (SCF) scores.

Calibration Techniques

Calibration is defined as the alignment between reported probabilities and observed correctness outcomes. The paper adopts standard statistical definitions and methods adapted to the text-to-SQL domain. The goal is to calibrate a scoring function $s$ , mapping SQL queries to confidence scores reflecting the likelihood of correctness, using post-hoc methods like Platt scaling.

Platt scaling adjusts model scores with a logistic regression, refining probabilistic outputs for improved calibration. A significant limitation of traditional Platt scaling is its reduction to a univariate calibration approach. The paper's novel multivariate Platt scaling extends this method by incorporating multiple signals derived from SQL syntax, specifically the frequency of sub-clauses in likely outputs. This multivariate approach aims to produce well-calibrated probabilities that communicate uncertainty effectively.

Methods and Approaches

The paper's methodological framework is centered around MPS and SCF scores. MPS extends Platt scaling by integrating signals from multiple sub-clause frequencies into a comprehensive confidence assessment. For each SQL output, SCF scores are computed by sampling alternative outputs using nucleus sampling and beam search techniques, and counting the frequency of each sub-clause across samples. These scores are then fed into a logistic regression calibration model, learning the correlation of sub-clause consistency with correctness.

Figure 2: Parsing a query to derive SCF scores, illustrating the frequency counting process across sampled outputs.

Figure 3: SQL queries parsed into tree structures, highlighting sub-clause computation for SCF signals.

Experimental Evaluation

The paper evaluates calibration efficacy on the SPIDER and BIRD datasets, established benchmarks for text-to-SQL parsing. Experiments compare raw LLM outputs, traditional Platt scaling, and the proposed MPS + SCF approach across multiple metrics: Brier score, expected calibration error (ECE), adaptive calibration error (ACE), and area under the curve (AUC) for error detection.

Results show that MPS significantly outperforms both uncalibrated and Platt-scaled probabilities. MPS consistently improves calibration error metrics and successfully increases AUC scores indicating enhanced error detection capabilities. The method demonstrates superior calibration across varied LLM architectures and query complexities, underscoring the utility of integrating structured SQL signals in calibration processes.

Figure 4: Calibration curves comparing Platt scaling and multivariate Platt scaling using fixed-width ECE-style bins.

Conclusion

This work presents a precise and innovative approach to calibrating confidence scores in LLM-based text-to-SQL parsing. By leveraging the structured signals inherent in SQL syntax, the proposed MPS + SCF method offers substantial improvements in predictive uncertainty calibration. The findings hold significant implications for enhancing transparency and reliability in semantic parsing tasks. Future work may extend this calibration methodology to other domains where structured output parsing is prevalent, further refining uncertainty quantification in AI systems.

Markdown Report Issue