Papers
Topics
Authors
Recent
Search
2000 character limit reached

UnsolvableQA Paradigm for LLM Calibration

Updated 28 January 2026
  • UnsolvableQA is a framework that trains LLMs to distinguish between solvable, unsolvable, and beyond-capacity problems using paired benchmarks.
  • The framework leverages programmatic generation and reverse construction to create datasets with explicit solvability labels for rigorous evaluation.
  • Reinforcement learning with dynamic refusal thresholds and group-relative policy optimization is used to improve accuracy and prevent capability collapse.

The UnsolvableQA paradigm refers to a class of benchmarks and learning objectives designed to train and evaluate LLMs on their ability to solve problems, detect objectively unsolvable instances (i.e., those with internal contradictions), and calibrate their refusal behavior on instances that are solvable in principle but beyond their current capability. The paradigm is operationalized via the UnsolvableQA dataset and the UnsolvableRL framework, which jointly drive LLM alignment on solvability detection and prudent refusal (Peng et al., 1 Dec 2025).

1. Motivation and Formalization

Traditional LLM evaluation focuses on instance accuracy, but LLMs frequently hallucinate confident answers to unsolvable or contradictory tasks, leading to reliability failures. The UnsolvableQA paradigm addresses this limitation by explicitly pairing solvable and unsolvable problems and constructing multi-faceted objectives that encourage correct problem-solving, high-precision rejection on inherently unsolvable instances, and careful refusal on exceptionally difficult but technically solvable cases.

Formally, given an input xx drawn from a dataset D\mathcal{D}, there are three instance categories:

  • xSx \in \mathcal{S}: objectively solvable, with a ground truth answer set;
  • xUx \in \mathcal{U}: unsolvable, containing inherent contradictions;
  • xCx \in \mathcal{C} (implied): solvable in principle but outside current model capacity.

The model’s policy πθ\pi_\theta must produce:

  • a correct solution when xSx \in \mathcal{S};
  • a canonical “\langleunsolvable\rangle” tag on xUx \in \mathcal{U};
  • a “\langlebeyond_capacity\rangle” tag when feasible, but the model’s accuracy is empirically low on xx’s cohort.

2. Construction of UnsolvableQA Data

UnsolvableQA consists of paired solvable and unsolvable instances across diverse domains such as Game24, Hamiltonian paths, Hitori, Mazes, and AIME-style mathematics problems (Peng et al., 1 Dec 2025). Its construction involves two main methodologies:

  • Programmatic Generation: For logic puzzles, solvable/unsolvable instances are created via constraint-based enumeration and contradiction injection.
  • Reverse Construction: For mathematical domains, valid reasoning chains are perturbed to deliberately introduce contradictions, ensuring objective unsolvability.

The dataset for Qwen3 experiments includes 637 training instances (348 solvable, 289 unsolvable) and 699 test instances, carefully balanced across domains to prevent bias and to distinguish between unsolvability and incapability.

3. UnsolvableRL Framework

The UnsolvableRL framework aligns LLMs for the UnsolvableQA task by optimizing reinforcement learning objectives that integrate accuracy, unsolvability detection, and capability-calibrated refusal (Peng et al., 1 Dec 2025). The joint per-trajectory reward is defined as:

R(yx)=Racc(yx)+Rdetect(yx)+Rcal(yx)R(y|x) = R_{\rm acc}(y|x) + R_{\rm detect}(y|x) + R_{\rm cal}(y|x)

where:

  • Racc(yx)R_{\rm acc}(y|x) rewards correct answers (+1)(+1) on xSx \in \mathcal{S},
  • Rdetect(yx)R_{\rm detect}(y|x) rewards correct \langleunsolvable\rangle tags (+1)(+1) for xUx \in \mathcal{U} and imposes a penalty ρ=0.5\rho=-0.5 for false rejections on xSx \in \mathcal{S},
  • Rcal(yx)R_{\rm cal}(y|x) encourages \langlebeyond_capacity\rangle only when the model’s empirical batch accuracy β\beta falls below a dynamic threshold τ\tau: β=1Ni=1N1[yi correct]\beta = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[y_i \text{ correct}]

Rcal(y)=λ(τβ)1[y=beyond_capacity]R_{\rm cal}(y) = \lambda(\tau - \beta)\,\mathbf{1}[y = \langle \text{beyond\_capacity} \rangle]

with λ>0\lambda > 0 and τ\tau annealed upward toward $1$ during training.

4. Learning Algorithm: Group-Relative Policy Optimization

UnsolvableRL employs Group-Relative Policy Optimization (GRPO) (Peng et al., 1 Dec 2025), which uses within-group normalization of rewards to handle heterogeneous objectives and stabilize training across diverse instance types. For group size GG, the approach samples GG outputs per prompt, computes trajectory rewards, and normalizes advantages:

Ai=RiμRσR+ϵA_i = \frac{R_i - \mu_R}{\sigma_R + \epsilon}

with

μR=1GjRj, σR2=1Gj(RjμR)2\mu_R = \frac{1}{G}\sum_j R_j, \ \sigma_R^2 = \frac{1}{G}\sum_j (R_j-\mu_R)^2

The surrogate objective takes the PPO-style clipped form: J(θ)=Ex,{yi}[1Gi=1Gmin(ri(θ)Ai, clip(ri(θ),1ε,1+ε)Ai)]\mathcal{J}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G}\sum_{i=1}^G \min(r_i(\theta) A_i, \ \mathrm{clip}(r_i(\theta), 1-\varepsilon, 1+\varepsilon)\,A_i) \right] This is optimized by stochastic gradient ascent without a separate value network, facilitating direct handling of the composite reward.

5. Capability Collapse and Negative Supervision

A critical empirical finding is the capability collapse phenomenon: if an LLM is trained on only solvable instances, it loses the ability to detect unsolvability—unsolvable detection accuracy collapses to near zero (Peng et al., 1 Dec 2025). This is attributed to gradient interference at the refusal head; if the features for solvable and unsolvable instances are correlated, negative updates on solvable data suppress refusal on all inputs. The UnsolvableRL protocol prevents collapse by:

  • ensuring every RL batch contains both solvable and unsolvable data,
  • employing a negative false-rejection penalty (ρ<0\rho < 0) to avoid universal refusal,
  • dynamically tuning the refusal threshold τ\tau to incentivize prudent, data-driven calibration.

Ablation experiments confirm the indispensability of these components; even the use of a fixed τ\tau leads to either zero or universal refusal in the limit.

6. Key Empirical Results

In Qwen3-4B experiments:

  • The baseline instruct model achieves a combined score (mean of solvable accuracy SS and unsolvability-rejection UU) of 49.2%\approx 49.2\%.
  • UnsolvableRL achieves a combined score of 88.3%\approx 88.3\% (S69.5%, U90.9%S \approx 69.5\%,\ U \approx 90.9\%).
  • Unsolvable instance rejection rises from 36.3%36.3\% to 90.9%90.9\%; solvable accuracy for some domains (e.g., Game24) increases from 49.0%49.0\% to 95.5%95.5\%.

Ablation on "Solvable-Only" training collapses U-detection to 1.5%\leq 1.5\%. Fixed-τ\tau ablations confirm worse trade-offs compared to a dynamic schedule. The approach establishes that (i) negative-data exposure, (ii) a negative false-rejection penalty, and (iii) a dynamic refusal threshold are all necessary for robust boundary-of-solvability alignment.

7. Limitations and Directions for Future Work

Primary limitations of the UnsolvableQA framework and its associated RL protocol include:

  • Dependence on High-Quality Contradictory Data: Constructing paired unsolvable examples in open-ended domains remains labor-intensive (Peng et al., 1 Dec 2025).
  • Sensitivity to Threshold Scheduling: The dynamic τ\tau schedule and calibration scaling parameters require careful selection.
  • Limited Generalization Evidence: Extensions are needed for out-of-distribution (OOD) unsolvable benchmarks, further feature-space diagnostics, application to alternative policy optimization methods (e.g., PPO, DPO), and to multi-agent/self-play regimes.

Potential enhancements involve incorporating human feedback on refusal calibration, uncertainty-aware token-level rewards, and broader stress-testing for generalization beyond current synthetic datasets.


In sum, the UnsolvableQA paradigm operationalizes and evaluates LLMs’ ability to distinguish between solvability, unsolvability, and incapability, using a principled reinforcement learning alignment framework and a synthetic paired dataset (Peng et al., 1 Dec 2025). This approach provides a rigorous, scalable testbed for research on reliable model refusal and rejection calibration in high-stakes reasoning domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UnsolvableQA Paradigm.