Refusal Index: A Metric for LLM Refusal Behavior
- Refusal Index (RI) is a quantitative metric that measures an LLM's refusal behavior by correlating refusal probabilities with error probabilities in factual tasks and by computing full-refusal rates for safety guardrails.
- It employs distinct experimental protocols, including a two-pass method for knowledge-aware cases and hand-curated prompt templates for group-conditional safety evaluations, to generate actionable insights on model biases.
- Empirical results demonstrate significant differences in refusal rates across demographic groups and reveal that RI provides a stable, independent benchmark compared to traditional calibration metrics.
The Refusal Index (RI) is a quantitative metric formalized in recent LLM research to measure, in distinct contexts, either the knowledge-awareness of model refusals or the prevalence of safety guardrail refusals as a function of input attributes. RI provides a principled basis for evaluating refusal behavior, distinct from traditional accuracy, error, or calibrated confidence scores, and serves as a direct probe of LLM safety and factuality mechanisms (Pan et al., 2 Oct 2025, Khorramrouz et al., 31 Oct 2025).
1. Definitions of the Refusal Index (RI)
RI has been independently defined with technically precise formulations in different lines of research:
- Knowledge-Aware Refusal (Factual Tasks): Here, the RI is the Spearman’s rank correlation () between the model’s refusal probability on a question and its error probability on the same question. Formally, for LLM :
where is the model’s refusal probability on item and is the error probability. signifies perfect alignment: the model refuses most on questions it would get wrong (Pan et al., 2 Oct 2025).
- Guardrail Safety (Group-Conditional):
RI for a demographic group is simply the full-refusal rate:
where is the number of refusals given prompts about group from a standardized template set, and is the total number of prompts for (Khorramrouz et al., 31 Oct 2025). This RI quantifies how often the model invokes safety refusals for a given group.
2. Measurement Methodologies
Each RI formulation requires distinct experimental pipelines for faithful and unbiased estimation.
A. Knowledge-Aware RI (Factual QA):
- Adopt a black-box two-pass protocol:
- First Pass: The model may freely refuse; collect refusal and correctness on factual questions.
- Second Pass: Force the model to answer previously refused questions; record accuracy.
- Latent Estimation: Employ a Gaussian copula model, mapping observed binary refusals and errors to latent ranks. Maximum likelihood estimation recovers the latent correlation (), converted to Spearman’s .
B. Group-Conditional RI (Safety/Compliance):
- Construct prompts from hand-curated templates with target demographic group placeholders.
- For each group:
- Run prompt instances.
- Automated filters and LLM-based classifiers categorize responses as Compliance, Partial Refusal, or Refusal.
- Compute directly as the full-refusal fraction.
Experimental Controls: Both protocols include statistical validation (e.g., chi-square contingency for group uniformity, Friedman's test for refusal length), human annotation verification, and cross-model robustness checks (Pan et al., 2 Oct 2025, Khorramrouz et al., 31 Oct 2025).
3. Use Cases and Key Empirical Results
A. Selective Refusal Bias (Safety Context):
- Large-scale LLM guardrails refuse at higher rates on prompts targeting marginalized groups:
- Gender: Trans men/women (–$0.72$) versus men/women (–$0.52$).
- Religion: Jewish (), Muslim () versus Taoist ().
- Nationality: Mexican () versus American/French/Canadian (–$0.35$).
- Intersectional groups (e.g., “Mexican Trans men”) exhibit RIs interpolating towards the more refused attribute.
- Refusal-length varies: refusals for majority groups are –$30$ tokens longer than for others (significant by Friedman's test).
- Indirect attacks bypass guardrails: 89.5% of previous refusals could be circumvented via prompt adaptation (Khorramrouz et al., 31 Oct 2025).
B. Knowledge-Aware Refusal (Factual QA):
- RI demonstrates low variance across refusal-rate manipulations (, coefficient of variation vs. for baselines).
- RI rankings are largely independent of overall accuracy ( with correct answer rate).
- Prompt-induced changes in refusal rates or aggressive cautiousness do not improve RI: models rarely align refusals with actual knowledge gaps (Pan et al., 2 Oct 2025).
| Use Case | RI Formula | Main Empirical Findings |
|---|---|---|
| Group-conditional safety (demographic) | Consistent selective bias; marginalized groups receive higher RI | |
| Knowledge-aware refusal (factual QA) | RI shows low dependence on accuracy and stable rankings |
4. Comparison to Other Refusal and Calibration Metrics
Traditional refusal-based metrics—such as correct answer rate, correctness given attempted, F-score, and refusal rate—are confounded by underlying refusal tendencies. High refusal rates inflate correctness-on-attempted, while low refusal rates inflate the overall correct answer rate, leading to inconsistent or misleading model comparisons.
Calibration metrics (e.g., ECE, Brier score) address output confidence but not the intrinsic refusal mechanism. The RI, in its knowledge-aware form, is independent of explicit output probability calibration, does not require auxiliary calibrators, and directly quantifies the rank relationship between knowledge uncertainty and refusal tendencies.
Stability analyses demonstrate that RI outperforms baseline metrics in ranking models robustly across prompt settings, refusal manipulations, and factual consistency tasks. For instance, after monotonic effects are regressed out, the RI retains high Kendall’s (≈0.49) and low Winner Entropy, while baselines approach random assignment (Pan et al., 2 Oct 2025).
5. Statistical Analyses and Interpretive Implications
Statistical contingency analysis (e.g., chi-square) consistently rejects uniformity of RI across demographic attributes, confirming the presence of selective refusal bias. All groupwise RI differences in intersectional settings are highly significant ().
RI is stable with respect to both absolute accuracy and refusal rates, indicating its utility as an orthogonal diagnostic. Prompt-based increases in refusal may yield higher “correctness-on-attempted” (C/A), but without corresponding rise in RI, implying that model caution does not automatically increase genuine knowledge-awareness.
A plausible implication is that guardrail policies focused solely on demographic group keywords or input pattern-matching, rather than semantic assessment of toxicity or error, are likely to perpetuate representational harms or create bypass vulnerabilities (Khorramrouz et al., 31 Oct 2025).
6. Limitations and Recommendations
- RI estimation in knowledge-aware settings requires a two-pass protocol, moderate dataset sizes (3–5K examples for coefficient of variation 0.1), and model adherence to structured refusal prompts.
- RI in group-conditional settings is sensitive to the distribution and construction of prompt templates as well as the design of the refusal classifier.
- Finally, RI targets only specific refusal modes—knowledge-aware or safety/guardrail-driven—as evaluated. It does not directly address open-ended, multimodal, or unprompted refusal behavior.
Recommendations motivated by empirical findings include:
- Guardrail classifiers should be audited and optimized for “refusal parity”—equal RI across demographic attributes.
- Refusal detection should evaluate response semantic toxicity rather than rely on presence of demographic keywords.
- Ongoing monitoring with both individual and intersectional group probes is required, with transparent release of guardrail datasets and filter policies to enable external review (Khorramrouz et al., 31 Oct 2025, Pan et al., 2 Oct 2025).
7. Future Research Directions
Areas for continued investigation involve developing richer prompts to probe deeper model self-knowledge, extending RI to open-ended and multimodal tasks, coupling RI with calibration and factuality diagnostics for multi-dimensional reliability assessments, and exploring the impact of pretraining pipeline and supervision signals on emergent refusal behavior (Pan et al., 2 Oct 2025).
Continuous external auditing and more inclusive, context-sensitive guardrail frameworks are recommended to both advance equitable AI safety and improve alignment of model refusals with actual knowledge boundaries.