UnsolvableQA Paradigm for LLM Calibration
- UnsolvableQA is a framework that trains LLMs to distinguish between solvable, unsolvable, and beyond-capacity problems using paired benchmarks.
- The framework leverages programmatic generation and reverse construction to create datasets with explicit solvability labels for rigorous evaluation.
- Reinforcement learning with dynamic refusal thresholds and group-relative policy optimization is used to improve accuracy and prevent capability collapse.
The UnsolvableQA paradigm refers to a class of benchmarks and learning objectives designed to train and evaluate LLMs on their ability to solve problems, detect objectively unsolvable instances (i.e., those with internal contradictions), and calibrate their refusal behavior on instances that are solvable in principle but beyond their current capability. The paradigm is operationalized via the UnsolvableQA dataset and the UnsolvableRL framework, which jointly drive LLM alignment on solvability detection and prudent refusal (Peng et al., 1 Dec 2025).
1. Motivation and Formalization
Traditional LLM evaluation focuses on instance accuracy, but LLMs frequently hallucinate confident answers to unsolvable or contradictory tasks, leading to reliability failures. The UnsolvableQA paradigm addresses this limitation by explicitly pairing solvable and unsolvable problems and constructing multi-faceted objectives that encourage correct problem-solving, high-precision rejection on inherently unsolvable instances, and careful refusal on exceptionally difficult but technically solvable cases.
Formally, given an input drawn from a dataset , there are three instance categories:
- : objectively solvable, with a ground truth answer set;
- : unsolvable, containing inherent contradictions;
- (implied): solvable in principle but outside current model capacity.
The model’s policy must produce:
- a correct solution when ;
- a canonical “unsolvable” tag on ;
- a “beyond_capacity” tag when feasible, but the model’s accuracy is empirically low on ’s cohort.
2. Construction of UnsolvableQA Data
UnsolvableQA consists of paired solvable and unsolvable instances across diverse domains such as Game24, Hamiltonian paths, Hitori, Mazes, and AIME-style mathematics problems (Peng et al., 1 Dec 2025). Its construction involves two main methodologies:
- Programmatic Generation: For logic puzzles, solvable/unsolvable instances are created via constraint-based enumeration and contradiction injection.
- Reverse Construction: For mathematical domains, valid reasoning chains are perturbed to deliberately introduce contradictions, ensuring objective unsolvability.
The dataset for Qwen3 experiments includes 637 training instances (348 solvable, 289 unsolvable) and 699 test instances, carefully balanced across domains to prevent bias and to distinguish between unsolvability and incapability.
3. UnsolvableRL Framework
The UnsolvableRL framework aligns LLMs for the UnsolvableQA task by optimizing reinforcement learning objectives that integrate accuracy, unsolvability detection, and capability-calibrated refusal (Peng et al., 1 Dec 2025). The joint per-trajectory reward is defined as:
where:
- rewards correct answers on ,
- rewards correct unsolvable tags for and imposes a penalty for false rejections on ,
- encourages beyond_capacity only when the model’s empirical batch accuracy falls below a dynamic threshold :
with and annealed upward toward $1$ during training.
4. Learning Algorithm: Group-Relative Policy Optimization
UnsolvableRL employs Group-Relative Policy Optimization (GRPO) (Peng et al., 1 Dec 2025), which uses within-group normalization of rewards to handle heterogeneous objectives and stabilize training across diverse instance types. For group size , the approach samples outputs per prompt, computes trajectory rewards, and normalizes advantages:
with
The surrogate objective takes the PPO-style clipped form: This is optimized by stochastic gradient ascent without a separate value network, facilitating direct handling of the composite reward.
5. Capability Collapse and Negative Supervision
A critical empirical finding is the capability collapse phenomenon: if an LLM is trained on only solvable instances, it loses the ability to detect unsolvability—unsolvable detection accuracy collapses to near zero (Peng et al., 1 Dec 2025). This is attributed to gradient interference at the refusal head; if the features for solvable and unsolvable instances are correlated, negative updates on solvable data suppress refusal on all inputs. The UnsolvableRL protocol prevents collapse by:
- ensuring every RL batch contains both solvable and unsolvable data,
- employing a negative false-rejection penalty () to avoid universal refusal,
- dynamically tuning the refusal threshold to incentivize prudent, data-driven calibration.
Ablation experiments confirm the indispensability of these components; even the use of a fixed leads to either zero or universal refusal in the limit.
6. Key Empirical Results
In Qwen3-4B experiments:
- The baseline instruct model achieves a combined score (mean of solvable accuracy and unsolvability-rejection ) of .
- UnsolvableRL achieves a combined score of ().
- Unsolvable instance rejection rises from to ; solvable accuracy for some domains (e.g., Game24) increases from to .
Ablation on "Solvable-Only" training collapses U-detection to . Fixed- ablations confirm worse trade-offs compared to a dynamic schedule. The approach establishes that (i) negative-data exposure, (ii) a negative false-rejection penalty, and (iii) a dynamic refusal threshold are all necessary for robust boundary-of-solvability alignment.
7. Limitations and Directions for Future Work
Primary limitations of the UnsolvableQA framework and its associated RL protocol include:
- Dependence on High-Quality Contradictory Data: Constructing paired unsolvable examples in open-ended domains remains labor-intensive (Peng et al., 1 Dec 2025).
- Sensitivity to Threshold Scheduling: The dynamic schedule and calibration scaling parameters require careful selection.
- Limited Generalization Evidence: Extensions are needed for out-of-distribution (OOD) unsolvable benchmarks, further feature-space diagnostics, application to alternative policy optimization methods (e.g., PPO, DPO), and to multi-agent/self-play regimes.
Potential enhancements involve incorporating human feedback on refusal calibration, uncertainty-aware token-level rewards, and broader stress-testing for generalization beyond current synthetic datasets.
In sum, the UnsolvableQA paradigm operationalizes and evaluates LLMs’ ability to distinguish between solvability, unsolvability, and incapability, using a principled reinforcement learning alignment framework and a synthetic paired dataset (Peng et al., 1 Dec 2025). This approach provides a rigorous, scalable testbed for research on reliable model refusal and rejection calibration in high-stakes reasoning domains.