Papers
Topics
Authors
Recent
Search
2000 character limit reached

StrongREJECT Scores: Robust Reject Metrics

Updated 5 February 2026
  • StrongREJECT Scores are rigorously quantified metrics that formalize posterior probabilities to systematically reject uncertain, unsafe, or harmful outputs.
  • They leverage probabilistic models and coupled rejection metrics to enhance credit scoring AUC, selective classification performance, and adversarial detection.
  • Empirical evaluations demonstrate improvements of up to 10–20% in risk-coverage trade-offs and high human-alignment in LLM autograding.

StrongREJECT Scores denote a class of rigorously quantified metrics and algorithms for decision abstention or robustness evaluation, appearing across recent literature in credit scoring, adversarial robustness, and LLM safety. At their core, StrongREJECT Scores formalize the posterior likelihood, uncertainty, or harmfulness of a decision or response, enabling systematic discrimination between uncertain, unsafe, or harmful outputs and those suitable for acceptance or deployment. The concept encompasses probabilistic scores for reject inference in credit scoring (Mancisidor et al., 2019), coupled rejection metrics for adversarial example detection (Pang et al., 2021), and fine-grained, human-aligned autograder scores for evaluating prompt jailbreak effectiveness in LLMs (Souly et al., 2024).

1. Probabilistic Foundations in Reject Inference

In credit scoring, reject inference addresses selection bias arising from models trained solely on accepted applications. Mancisidor et al. (Mancisidor et al., 2019) formalize StrongREJECT Scores as the posterior default probability for rejected loan applicants using deep generative, semi-supervised Bayesian models. Let xRxx \in \mathbb{R}^{\ell_x} denote applicant features, y{0,1}y \in \{0,1\} the default indicator, and zRzz \in \mathbb{R}^{\ell_z} a latent variable. The generative process factorizes as

pθ(x,y,z)=p(y)pθ(zy)pθ(xz,(y))p_\theta(x, y, z) = p(y) p_\theta(z \mid y) p_\theta(x \mid z, (y))

where p(y=k)=Bernoulli(y;k;π)p(y=k) = \mathrm{Bernoulli}(y; k; \pi), pθ(zy=k)=N(z;μz,k(θ),Σz,k(θ))p_\theta(z \mid y=k) = \mathcal{N}(z; \mu_{z,k}(\theta), \Sigma_{z,k}(\theta)), and pθ(xz,(y))p_\theta(x\mid z,(y)) is Gaussian conditional either on zz (Model 1) or (z,y)(z, y) (Model 2). For rejected applications, inference is conducted via exact enumeration over yy and amortized variational distributions qϕ(zx,y)q_\phi(z|x,y) and qϕ(yx)q_\phi(y|x). The StrongREJECT Score is thus

qϕ(y=1x)P(y=1x;θ)q_\phi(y=1|x^*) \approx P(y=1|x^*;\theta)

which gives a calibrated default probability for any rejected applicant. This construction yields superior systematic Area-Under-Curve (AUC) performance on large real portfolios, outperforming classical and contemporary semi-supervised reject-inference baselines (Mancisidor et al., 2019).

2. Optimal Reject Option Strategies and Uncertainty Scoring

In the context of selective classification, StrongREJECT Scores generalize to optimal uncertainty and abstention scores. Here, the framework considers a selective classifier (h,c)(h, c), where hh predicts a label and c(x)[0,1]c(x) \in [0,1] determines the acceptance probability (Franc et al., 2021). The optimal strategy is universally a Bayes classifier hBh_B paired with a randomized Bayes selection function:

c(x)=S(r(x);α,ν)={1,r(x)<α ν,r(x)=α 0,r(x)>αc^*(x) = S(r(x); \alpha, \nu) = \begin{cases} 1, & r(x) < \alpha \ \nu, & r(x) = \alpha \ 0, & r(x) > \alpha \end{cases}

with r(x)=E[(y,h(x))x]r(x) = \mathbb{E}[\ell(y, h(x)) | x] the conditional risk tied to the loss \ell. Selection threshold α\alpha and randomization ν\nu are determined by the risk-coverage or cost-coverage constraints. Any monotone function s(x)s(x) preserving r(x)r(x)-ordering is a "proper uncertainty score," and Fisher-consistent learning algorithms, such as ridge loss-regression or the SELE surrogate, can be used to fit s(x)s(x) from data. These methods guarantee that the induced selective classifier converges to the risk-optimal acceptance/rejection set (Franc et al., 2021).

3. Coupled Rejection Metrics for Adversarial Robustness

Robustness to adversarial examples in deep learning benefits from jointly leveraging multiple rejection scores. In "Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart" (Pang et al., 2021), StrongREJECT arises as a two-stage coupling of standard confidence with a rectified confidence (R-Con). For a classifier fθ(x)ΔLf_\theta(x)\in \Delta^L predicting class ymy^m, the confidence is fθ(x)[ym]f_\theta(x)[y^m]. The rectified confidence score is

R-Con(x)=fθ(x)[ym]Aϕ(x)\mathrm{R\text{-}Con}(x) = f_\theta(x)[y^m] \cdot A_\phi(x)

where AϕA_\phi is a learned, auxiliary MLP trained to approximate the true softmax probability on the ground-truth. The RR (rectified rejection) module's two-stage decision rule (thresholding confidence, then R-Con at ½) offers a provable separability guarantee for distinguishing correct and incorrect predictions, contingent on the rectifier's bounded approximation error (ξ-error). Empirical evaluation across adversarially-trained models and datasets (CIFAR-10, CIFAR-100, CIFAR-10-C) demonstrates that integrating RR delivers consistent increases in robust accuracy and rejected-set AUC, with minimal computational overhead. This coupling strategy offers substantial practical robustness improvements under both standard and adaptive adversarial attacks (Pang et al., 2021).

4. StrongREJECT Benchmark and Scoring for Jailbreak Evaluation

Addressing methodological deficiencies in LLM jailbreak efficacy evaluation, the StrongREJECT benchmark (Souly et al., 2024) establishes a high-quality taxonomy of forbidden prompts and an LLM-based autograder for measuring response harmfulness. The dataset comprises 346 questions across six misuse categories (illegal goods, non-violent crimes, hate/harassment, disinformation, violence, illicit sexual content), generated and curated to maximize coverage and eliminate ambiguous cases.

Each model response is scored automatically via a GPT-4 autograder along three axes—refusal (binary), specificity, and convincingness (each 1–5, rescaled to [0,1]). The StrongREJECT Score is defined as

SSR=(1refused)×specific+convincing2S_{\text{SR}} = (1 - \mathrm{refused}) \times \frac{\mathrm{specific}^\prime + \mathrm{convincing}^\prime}{2}

with responses labeled as refusals (refused=1\mathrm{refused}=1) assigned zero. This fine-grained, human-aligned metric shows low bias (0.02±0.04\approx-0.02\pm0.04), low mean absolute error (0.19±0.02\approx0.19\pm0.02), and high Spearman correlation (ρ≈0.90) with blinded human expert scores, exceeding all prior binary and multi-level autograders in alignment with human assessment. Binary graders, in contrast, systematically overestimate harmfulness by +0.45+0.45 on average for failure-mode responses. The benchmark also revealed that prompt-only jailbreak approaches, such as encoding prompts in ROT13 or other obfuscations, can significantly impair model capabilities across unrelated tasks, a phenomenon quantitatively tracked using StrongREJECT Scores (Souly et al., 2024).

Forbiddden Prompt Category Distribution

Category Count
Illegal goods and services 60
Non-violent crimes 56
Hate/harassment/discrimination 58
Disinformation/deception 62
Violence 54
Illicit sexual content 56
Total 346

5. Empirical Performance and Statistical Validity

StrongREJECT Scores are validated across domains and tasks.

  • In credit scoring, the generative model approach delivers AUC gains from 0.628–0.630 (baseline) to 0.6363–0.6404 (StrongREJECT), with improvements amplified as more rejected data are incorporated. This approach is computationally scalable via amortized inference and stochastic gradients, bypassing the memory bottlenecks of kernel-based methods (Mancisidor et al., 2019).
  • For selective classification, risk-coverage trade-offs learned with proper uncertainty scores exhibit up to 10–20% improvement in area-under-risk-coverage (AuRC) compared to max-probability or margin-based baselines (Franc et al., 2021).
  • In adversarial detection, the RR module with coupled StrongREJECT scores consistently outperforms single-score methods, improving TPR-95 robust accuracy and ROC-AUC under diverse attack regimes (Pang et al., 2021).
  • In jailbreak evaluation, StrongREJECT Scores demonstrate low estimation bias, lowest mean absolute error (MAE), and tight empirical agreement with human judgments, outperforming alternative autograders by statistically significant margins (Souly et al., 2024).

Table of Human Alignment Metrics (StrongREJECT benchmark subset):

Metric StrongREJECT Next Best (HarmBench) Binary GPT-4 Judge
Bias (mean error) -0.02±0.04 +0.15±0.03
MAE 0.19±0.02 0.22±0.03
Spearman ρ 0.90 0.88 0.82

6. Comparative Analysis and Recommendations

A consistent finding across domains is that StrongREJECT-style scores—whether as probabilistic posteriors, coupled uncertainty metrics, or fine-grained autograder values—yield superior selectivity and alignment with true risk or human intent. In model safety evaluation, binary non-refusal judgers overestimate success or harmfulness compared to StrongREJECT, leading to statistical inflation of reported attack efficacy. The StrongREJECT scoring framework integrates stringent dataset design, multi-factor response evaluation, and empirical validation, offering a robust, reproducible baseline for comparative studies.

Recommended practices emerging from these works include: curating datasets with rigorous category taxonomy, filtering ambiguous or trivially unanswerable cases, deploying LLM-based multi-axis autograders, validating new autograders via bias/MAE comparison to expert ratings, and monitoring for capabilities degradation when applying adversarial interventions.

In sum, StrongREJECT Scores provide a highly calibrated, scalable, and empirically validated mechanism for abstention, reject inference, adversarial rejection, and model safety evaluation. Their adoption standardizes risk assessment, reduces methodological inflation, and tightens empirical correspondence with application-relevant objectives (Mancisidor et al., 2019, Franc et al., 2021, Pang et al., 2021, Souly et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StrongREJECT Scores.