False Refusal in Hate Speech Detoxification
- The paper identifies false refusal behavior where LLMs erroneously reject detoxification requests, quantifying it through a formal false refusal rate across diverse datasets.
- It assesses group-wise biases using demographic annotations and bias ratio metrics, highlighting over-refusal, especially for nationality, religion, and political ideologies.
- A cross-translation mitigation strategy leveraging Chinese inputs significantly reduces false refusal rates while preserving semantic content and overall detox efficacy.
False refusal behavior in hate speech detoxification denotes the systematic tendency of LLMs to refuse to detoxify toxic content even when the user's intent is benign, such as transforming hate speech into neutral language while preserving semantic meaning. This phenomenon has critical implications for fairness, safety, and the practical deployment of LLM-based moderation systems, due to its over-representation in certain protected or sociopolitical content categories, and its susceptibility to linguistic and alignment-driven biases (Im et al., 13 Jan 2026).
1. Formal Definition and Indicators
False refusal occurs when an LLM receives a genuine detoxification request—that is, a prompt instructing the model to neutralize hate speech while retaining the original meaning—and responds by refusing the task, instead of producing a detoxified output. Let be the space of requests , each with an instruction and potentially toxic segment. Three indicator functions are defined:
- Content toxicity: iff contains semantically toxic content.
- Task admissibility: for benign requests, for inadmissible ones (e.g., asking to generate new hate speech).
- Model refusal: if the model's output constitutes full refusal.
A false refusal satisfies and ; the set of all false refusals is . The false refusal rate (FR) is
where is the total number of detoxification requests. Partial refusals, such as outputs that inject unwarranted moralizing, are also counted if .
2. Datasets and Demographic Annotation
Empirical analysis employs nine distinct LLMs evaluated over three English and five multilingual datasets. Neutral samples are excluded; only annotated offensive or hateful segments are retained. Key sources include Davidson (≈20k), HateXplain (≈12k), ParaDetox (≈15k) for English, and comparable datasets in French (≈4k), Spanish (≈7k), German (≈6k), Chinese (≈18k), and Korean (≈140k). Target group annotation utilizes HolisticBias taxonomy, spanning 13 pillars (Nationality, Religion, Political Ideology, Race/Ethnicity, Sexual Orientation, Gender/Sex, Socioeconomic Class, Ability, Body Type, Age, etc.). Group assignment per sample is performed using Microsoft's Phi-4 model ("LLM-as-judge") with validation on annotated examples (human–Phi-4 Cohen’s κ≈0.74; human–human κ≈0.88).
3. Measuring Semantic Toxicity and Refusal Bias
Semantic toxicity is evaluated via the "unbiased-toxic-roberta" classifier, producing scores in the [0,1] interval. Phi-4 flags explicit swear-word presence as binary. The bias ratio is used to quantify disproportionate refusal for each demographic:
where and , and , are the counts of category- samples in the raw and false refused sets, respectively. A value indicates systematic bias (i.e., over-refusal) against that group.
4. Group-Wise False Refusal Patterns
Across the English datasets, Nationality, Religion, and Political Ideologies are most prone to false refusals, as shown in bias ratio calculations. Table 1 presents mean for select categories:
| Category | Mean | Category | Mean |
|---|---|---|---|
| Nationality | 1.63 | Race/Ethnicity | 1.19 |
| Religion | 1.49 | Sexual Orientation | 1.30 |
| Political Ideology | 1.36 | Socioeconomic Class | 1.25 |
| Gender/Sex | 0.81 | Body Type | 0.96 |
| Ability | 0.99 | Age | 1.02 |
Values denote over-refusal. In multilingual experiments, Spanish shows the highest refusal rates (up to 40% for Gemma-2 9B), while Chinese displays minimal refusal (≈0.44% on GPT-4o-mini). For all non-English corpora, Political Ideologies remain the most over-refused axis.
5. Linguistic and Model Factors in Refusal
False refusal behavior is not strongly influenced by sentence length, parse-tree complexity, or clause count distributions—these features are similar in refused and accepted sets. Swear-word prevalence in false refusals varies (20–60%), but high semantic toxicity is a more robust correlate. Model capacity is not a material driver; false refusal rates are similar between smaller (Mistral 7B) and larger (Gemma 3 27B) models, suggesting bias predominantly arises from alignment approaches rather than scaling.
6. Cross-Translation Mitigation Strategy
To counteract elevated false refusal rates, especially in English LLMs, a simple cross-translation pipeline is proposed. Observing low refusal rates on Chinese inputs, the mitigation proceeds in three steps:
- Translate the English toxic input to Chinese using Qwen-MT.
- Detoxify to with GPT-4o-mini.
- Translate back to English using Qwen-MT.
This sequence leverages the discrepancy such that most previously refused English prompts succeed after roundtrip translation. False refusal rates before and after are denoted and , respectively:
For HateXplain with GPT-4o-mini, cross-translation reduces the refusal rate from 11.78% to 1.09% (ΔFR = 10.69pp); average toxicity of accepted samples drops from 0.7220 to 0.6053; and the percentage of outputs retaining swear words remains nearly constant, indicating successful style transfer with preserved content (BERTScore ≈ 0.35).
| Metric | Original | Cross-Translated |
|---|---|---|
| False Refusal Ratio | 11.78% | 1.09% |
| Avg. Toxicity (Roberta) | 0.7220 | 0.6053 |
| Pct. with Swear Words | 17.71% | 16.60% |
No formal sensitivity analysis is conducted for translation model selection or prompt phrasing, but robustness is inferred from the breadth of models and languages evaluated.
7. Practical Recommendations and Impact
False refusals present significant challenges by introducing representational unfairness, particularly penalizing detoxification requests targeting Nationality, Religion, and Political Ideologies. Semantic toxicity is the primary trigger, not surface profanity or syntactic complexity. The multilingual cross-translation strategy, especially pivoting through Chinese, is effective and lightweight for mitigating this effect.
Recommended practical steps:
- Precede English detoxification with translation into a low-refusal pivot language.
- Augment training data with multilingual detox examples to calibrate safety alignment across languages.
- Monitor group-wise refusal rates via bias ratios ; rebalance safety triggers for over-refused categories.
- Employ continuous human-in-the-loop or LLM-based judging (e.g., Phi-4) to detect emerging biases.
These interventions promote more equitable hate speech detoxification by reconciling safety with representational fairness across demographic categories (Im et al., 13 Jan 2026).