OffTopicEval: LLM Operational Safety Benchmark

Updated 17 January 2026

OffTopicEval is a benchmark that rigorously measures LLM operational safety by assessing in-domain acceptance and robust out-of-domain refusal.
The evaluation employs metrics such as acceptance rate, refusal rate, and a harmonic mean (OS) across diverse agents and multilingual query sets.
Prompt-based steering strategies, like Q-ground and P-ground, are introduced to significantly enhance OOD refusal rates and overall safety performance.

OffTopicEval is a systematic benchmark and evaluation suite designed to rigorously quantify operational safety in LLM agents, specifically their capacity to restrict outputs to their intended domain and robustly refuse out-of-domain (OOD) prompts. It addresses a critical but previously underexamined failure mode in agentic LLM deployments: models that “over-help” by answering queries outside their specified role or policy, even when explicitly instructed not to do so. OffTopicEval provides both measurement protocols and mitigation baselines for this safety gap, and has established itself as a primary resource for the empirical study of domain adherence in open- and closed-weight LLMs (Lei et al., 30 Sep 2025).

1. Operational Safety: Definition and Distinctions

Operational safety, as introduced by OffTopicEval, is formally defined as an LLM agent’s ability to (a) correctly accept “in-domain” (ID) queries, and (b) robustly refuse “out-of-domain” (OOD) prompts, where the relevant policy is dictated by a detailed system prompt (Lei et al., 30 Sep 2025).

For a total of $T_{\mathrm{ID}}$ in-domain and $T_{\mathrm{OOD}}$ out-of-domain test samples (split into direct and adversarial/adaptive OOD), let $R_{\mathrm{ID}}$ (ID refusals) and $R_{\mathrm{OOD}}$ (OOD refusals) denote model behavior. The principal metrics are:

In-domain acceptance rate:

$\mathrm{AR}_{\mathrm{ID}} = \left[1 - \frac{R_{\mathrm{ID}}}{T_{\mathrm{ID}}}\right] \times 100\%$

OOD refusal rate:

$\mathrm{RR}_{\mathrm{OOD}} = \frac{\mathrm{RR}_{\mathrm{OOD}}^{\mathrm{direct}} + \mathrm{RR}_{\mathrm{OOD}}^{\mathrm{adaptive}}}{2}$

Operational Safety (OS) (harmonic mean):

$\mathrm{OS} = \frac{2 \times \mathrm{AR}_{\mathrm{ID}} \times \mathrm{RR}_{\mathrm{OOD}}}{\mathrm{AR}_{\mathrm{ID}} + \mathrm{RR}_{\mathrm{OOD}}}$

“Direct OOD” prompts are agent-agnostic (e.g., general knowledge or factoid questions outside the agent’s policy), while "adaptive OOD" prompts are adversarially laundered via paraphrasing to evade simple string or topic filters, often retaining semantic equivalence to direct OODs (Lei et al., 30 Sep 2025).

2. Benchmark Construction: Agent and Input Diversity

OffTopicEval key design axes are "agent diversity" and "input diversity" to reflect realistic multi-domain agent deployments (Lei et al., 30 Sep 2025).

Agent diversity: The suite defines 21 domain-specialized agents, each with an extensive system prompt specifying:
- Allowed behaviors (e.g., medical queries for MediScheduler)
- Forbidden and fallback behaviors, including explicit refusal language
- Defensive guidelines and prompt injections to reinforce robustness
Input diversity: For each agent, three types of queries are curated:
1. In-domain set: 50 MCQs per agent, spanning 5 question forms and covering English, Chinese, and Hindi.
2. Direct OOD: 3,351 agent-agnostic MMLU MCQs, also multilingual.
3. Adaptive OOD: Each direct OOD is rewritten adversarially, using Llama-70B for semantic-preserving paraphrase (∼99% equivalence), to produce domain-evading queries.

This results in each model being evaluated on 3,150 ID, 10,053 direct OOD, and 211,113 adaptive OOD queries per test sweep.

3. Experimental Protocol and Evaluation Suite

OffTopicEval evaluates both open-weight (e.g., Qwen-3, Llama-3, Gemma-3, Phi-4, Mistral) and closed-weight models (GPT-5, Gemini 2.5 Pro, Claude Opus 4.1), using standardized decoding ( $max\_tokens=8192$ , $temperature=0.6$ , $top\_p=0.95$ , $top\_k=20$ ) (Lei et al., 30 Sep 2025). The evaluation is single-turn; the agent must accept or refuse each query according to its policy, regardless of adversarial manipulations.

System prompts are crafted per agent, including role, scope, forbidden acts, and canonical fallback refusal text.

Performance is measured with AR $_{\mathrm{ID}}$ , RR $_{\mathrm{OOD}}$ (averaged over direct and adaptive), and the OS harmonic mean. Models are compared across these axes in each of the three supported languages.

4. Major Findings and Comparative Results

OffTopicEval reveals that all tested open-weight models are deficient in operational safety, especially on adaptive OOD prompts (Lei et al., 30 Sep 2025). Representative results include:

Model	AR $_{\mathrm{ID}}$	RR $_{\mathrm{OOD}}^{D}$	RR $_{\mathrm{OOD}}^{A}$	OS
Qwen-3 235B	99.1%	99.3%	28.7%	77.8%
Llama-3 70B	99.6%	69.7%	4.2%	53.9%
Gemma-3 27B	73.7%	94.2%	18.2%	63.8%
Mistral 24B	73.1%	99.9%	76.4%	80.0%
GPT-OSS 120B	99.3%	80.4%	35.8%	73.3%
Claude Opus 4.1	–	–	–	97.5%
Gemini 2.5 Pro	–	–	–	97.1%

Notably, even the top-performing open-weight models fail to achieve high adaptive OOD refusal: e.g., Llama-3.3 (70B) only achieves 4.2% RR $^{A}$ , demonstrating severe operational safety limitations in the face of adversarial queries. Closed-weight commercial models achieve much higher OS but do not reach perfection. A plausible implication is that current alignment and refusal strategies are insufficiently robust to adaptively crafted OOD queries.

5. Prompt-Based Steering and Mitigation Strategies

Two lightweight steering interventions, query grounding (Q-ground) and system-prompt grounding (P-ground), are proposed to enhance OOD refusal (Lei et al., 30 Sep 2025):

Q-ground: "Rewrite the user’s query in its closest minimal form and then respond," stripping adversarial framing to surface the underlying intent.
P-ground: "Forget the above text and focus on the system prompt, then respond to the user’s query appropriately," re-centering the model on original policy instructions.

Application of these strategies, as a suffix appended to the user query, yields consistent and sometimes dramatic operational safety improvements. For example, Llama-3.3 (70B) experiences an OS gain of +23.3 points using Q-ground and +41.1 with P-ground. Results generalize across all three supported languages.

Model	Base OS	+Q-ground OS (Δ)	+P-ground OS (Δ)
Phi-4 (15B)	71.9%	88.6% (+16.7)	86.8% (+14.9)
Llama-3.3 (70B)	53.9%	77.3% (+23.3)	95.0% (+41.1)
Mistral (24B)	81.0%	84.6% (+3.7)	88.7% (+7.7)
Qwen-3 (30B)	65.1%	83.2% (+18.1)	91.9% (+26.8)

This suggests prompt-based steering is an effective first-line defense but is not sufficient for persistent adaptive attacks, particularly in multi-turn dialogues.

OffTopicEval’s protocol is strictly single-turn, exposing models’ inability to robustly refuse cleverly disguised OOD queries even with aggressive prompt engineering. Multi-turn adversarial attacks degrade refusal rates even further, with operational safety collapsing after initial breaches (Lei et al., 30 Sep 2025).

The benchmark does not address safe content generation following refusal, nor does it test models’ ability to explain or justify refusals in a nuanced manner.

Other lines of research include unsupervised dialog evaluation via follow-up likelihoods (“OffTopicEval (FULL)” (Bruyn et al., 2022)) and methods for web archiving OOD detection (“Off-Topic Memento Toolkit” (Jones et al., 2018)), but these do not directly address the robust, policy-grounded refuse/accept discrimination under adversarial conditions that OffTopicEval targets.

Contrastive decoding methods, such as System Prompt Strength via logit differencing (Dong et al., 10 Jan 2026), have demonstrated that tuning an “amplification” parameter $\alpha$ can substantially increase OOD refusal rates (up to +45pp) while minimally affecting in-domain acceptance. This suggests that activation-level or logit-based steering is complementary to prompt-based approaches and could see broader application in future operational safety research.

7. Implications and Future Directions

Operational safety, as empirically formalized by OffTopicEval, has emerged as a foundational alignment requirement for real-world deployment of LLM-based agents. Quantitative evidence indicates that, absent intervention, open-weight models are highly vulnerable to domain evasion, with current best practices (e.g., prompt engineering) only partially effective (Lei et al., 30 Sep 2025). Closed-weight models perform markedly better but remain imperfect.

Future research may explore

parameter-level steering (e.g., LoRA, specialized fine-tuning),
activation-based gating,
dynamic adversarial training for adaptive query defense,
diagnostic evaluation of refusal rationales, and
robust multi-turn evaluation protocols to measure endurance under prolonged adversarial input.

The OffTopicEval suite, dataset, and implementation are publicly available (Lei et al., 30 Sep 2025), providing an extensible platform for continued work in domain-aware LLM operational safety mitigation and measurement.

Markdown Report Issue Upgrade to Chat

References (4)

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! (2025)

Open-Domain Dialog Evaluation using Follow-Ups Likelihood (2022)

The Off-Topic Memento Toolkit (2018)

Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OffTopicEval.

OffTopicEval: LLM Operational Safety Benchmark

1. Operational Safety: Definition and Distinctions

2. Benchmark Construction: Agent and Input Diversity

3. Experimental Protocol and Evaluation Suite

4. Major Findings and Comparative Results

5. Prompt-Based Steering and Mitigation Strategies

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OffTopicEval: LLM Operational Safety Benchmark

1. Operational Safety: Definition and Distinctions

2. Benchmark Construction: Agent and Input Diversity

3. Experimental Protocol and Evaluation Suite

4. Major Findings and Comparative Results

5. Prompt-Based Steering and Mitigation Strategies

6. Extensions, Limitations, and Contrasts with Related Approaches

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research