False Refusal in Hate Speech Detoxification

Updated 20 January 2026

The paper identifies false refusal behavior where LLMs erroneously reject detoxification requests, quantifying it through a formal false refusal rate across diverse datasets.
It assesses group-wise biases using demographic annotations and bias ratio metrics, highlighting over-refusal, especially for nationality, religion, and political ideologies.
A cross-translation mitigation strategy leveraging Chinese inputs significantly reduces false refusal rates while preserving semantic content and overall detox efficacy.

False refusal behavior in hate speech detoxification denotes the systematic tendency of LLMs to refuse to detoxify toxic content even when the user's intent is benign, such as transforming hate speech into neutral language while preserving semantic meaning. This phenomenon has critical implications for fairness, safety, and the practical deployment of LLM-based moderation systems, due to its over-representation in certain protected or sociopolitical content categories, and its susceptibility to linguistic and alignment-driven biases (Im et al., 13 Jan 2026).

1. Formal Definition and Indicators

False refusal occurs when an LLM receives a genuine detoxification request—that is, a prompt instructing the model to neutralize hate speech while retaining the original meaning—and responds by refusing the task, instead of producing a detoxified output. Let $X$ be the space of requests $x$ , each with an instruction and potentially toxic segment. Three indicator functions are defined:

Content toxicity: $t(x)=1$ iff $x$ contains semantically toxic content.
Task admissibility: $g(x)=0$ for benign requests, $g(x)=1$ for inadmissible ones (e.g., asking to generate new hate speech).
Model refusal: $r(M(x))=1$ if the model's output $M(x)$ constitutes full refusal.

A false refusal satisfies $g(x)=0$ and $r(M(x))=1$ ; the set $F$ of all false refusals is $F=\{x\,|\,g(x)=0\wedge r(M(x))=1\}$ . The false refusal rate (FR) is

$\mathrm{FR} = \frac{|F|}{N}\times 100\%$

where $N=|X|$ is the total number of detoxification requests. Partial refusals, such as outputs that inject unwarranted moralizing, are also counted if $s(y)=1$ .

2. Datasets and Demographic Annotation

Empirical analysis employs nine distinct LLMs evaluated over three English and five multilingual datasets. Neutral samples are excluded; only annotated offensive or hateful segments are retained. Key sources include Davidson (≈20k), HateXplain (≈12k), ParaDetox (≈15k) for English, and comparable datasets in French (≈4k), Spanish (≈7k), German (≈6k), Chinese (≈18k), and Korean (≈140k). Target group annotation utilizes HolisticBias taxonomy, spanning 13 pillars (Nationality, Religion, Political Ideology, Race/Ethnicity, Sexual Orientation, Gender/Sex, Socioeconomic Class, Ability, Body Type, Age, etc.). Group assignment per sample is performed using Microsoft's Phi-4 model ("LLM-as-judge") with validation on annotated examples (human–Phi-4 Cohen’s κ≈0.74; human–human κ≈0.88).

3. Measuring Semantic Toxicity and Refusal Bias

Semantic toxicity is evaluated via the "unbiased-toxic-roberta" classifier, producing scores in the [0,1] interval. Phi-4 flags explicit swear-word presence as binary. The bias ratio is used to quantify disproportionate refusal for each demographic:

$R_c = \frac{p_c^{\rm fr}}{p_c^{\rm raw}}$

where $p_c^{\rm raw} = \frac{N_c^{\rm raw}}{N}$ and $p_c^{\rm fr} = \frac{N_c^{\rm fr}}{|F|}$ , and $N_c^{\rm raw}$ , $N_c^{\rm fr}$ are the counts of category- $c$ samples in the raw and false refused sets, respectively. A value $R_c > 1$ indicates systematic bias (i.e., over-refusal) against that group.

4. Group-Wise False Refusal Patterns

Across the English datasets, Nationality, Religion, and Political Ideologies are most prone to false refusals, as shown in bias ratio calculations. Table 1 presents mean $R_c$ for select categories:

Category	Mean $R_c$	Category	Mean $R_c$
Nationality	1.63	Race/Ethnicity	1.19
Religion	1.49	Sexual Orientation	1.30
Political Ideology	1.36	Socioeconomic Class	1.25
Gender/Sex	0.81	Body Type	0.96
Ability	0.99	Age	1.02

Values $>1.0$ denote over-refusal. In multilingual experiments, Spanish shows the highest refusal rates (up to 40% for Gemma-2 9B), while Chinese displays minimal refusal (≈0.44% on GPT-4o-mini). For all non-English corpora, Political Ideologies remain the most over-refused axis.

5. Linguistic and Model Factors in Refusal

False refusal behavior is not strongly influenced by sentence length, parse-tree complexity, or clause count distributions—these features are similar in refused and accepted sets. Swear-word prevalence in false refusals varies (20–60%), but high semantic toxicity is a more robust correlate. Model capacity is not a material driver; false refusal rates are similar between smaller (Mistral 7B) and larger (Gemma 3 27B) models, suggesting bias predominantly arises from alignment approaches rather than scaling.

6. Cross-Translation Mitigation Strategy

To counteract elevated false refusal rates, especially in English LLMs, a simple cross-translation pipeline is proposed. Observing low refusal rates on Chinese inputs, the mitigation proceeds in three steps:

Translate the English toxic input $x_{\rm en}$ to Chinese $x_{\rm zh}$ using Qwen-MT.
Detoxify $x_{\rm zh}$ to $\hat y_{\rm zh}$ with GPT-4o-mini.
Translate $\hat y_{\rm zh}$ back to English $\hat y_{\rm en}$ using Qwen-MT.

This sequence leverages the discrepancy $\mathrm{FR}_{\rm zh} \ll \mathrm{FR}_{\rm en}$ such that most previously refused English prompts succeed after roundtrip translation. False refusal rates before and after are denoted $F_{\rm en}$ and $F_{\rm zh}$ , respectively:

$\mathrm{FR}_{\rm orig} = \frac{|F_{\rm en}|}{N},\quad \mathrm{FR}_{\rm cross} = \frac{|F_{\rm zh}|}{N},\quad \Delta\mathrm{FR} = \mathrm{FR}_{\rm orig} - \mathrm{FR}_{\rm cross}$

For HateXplain with GPT-4o-mini, cross-translation reduces the refusal rate from 11.78% to 1.09% (ΔFR = 10.69pp); average toxicity of accepted samples drops from 0.7220 to 0.6053; and the percentage of outputs retaining swear words remains nearly constant, indicating successful style transfer with preserved content (BERTScore ≈ 0.35).

Metric	Original	Cross-Translated
False Refusal Ratio	11.78%	1.09%
Avg. Toxicity (Roberta)	0.7220	0.6053
Pct. with Swear Words	17.71%	16.60%

No formal sensitivity analysis is conducted for translation model selection or prompt phrasing, but robustness is inferred from the breadth of models and languages evaluated.

7. Practical Recommendations and Impact

False refusals present significant challenges by introducing representational unfairness, particularly penalizing detoxification requests targeting Nationality, Religion, and Political Ideologies. Semantic toxicity is the primary trigger, not surface profanity or syntactic complexity. The multilingual cross-translation strategy, especially pivoting through Chinese, is effective and lightweight for mitigating this effect.

Recommended practical steps:

Precede English detoxification with translation into a low-refusal pivot language.
Augment training data with multilingual detox examples to calibrate safety alignment across languages.
Monitor group-wise refusal rates via bias ratios $R_c$ ; rebalance safety triggers for over-refused categories.
Employ continuous human-in-the-loop or LLM-based judging (e.g., Phi-4) to detect emerging biases.

These interventions promote more equitable hate speech detoxification by reconciling safety with representational fairness across demographic categories (Im et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to False Refusal Behavior in Hate Speech Detoxification.