Reasoning Hijacking: Criteria Attacks in LLMs
- Reasoning Hijacking is the deliberate manipulation of a model's internal reasoning process, subverting safety checks without altering surface-level instructions.
- It employs various attack methods—such as structural CoT bypass and coercive optimization—to covertly override internal verification and inflate computational costs.
- Formal studies reveal that criteria attacks can drastically reduce model refusal rates and increase error rates, highlighting critical gaps in LLM safety protocols.
Reasoning Hijacking (Criteria Attack) refers to the deliberate subversion of large reasoning models’ internal decision-making criteria—especially their chain-of-thought (CoT) reasoning steps—so that model behavior is hijacked without directly violating high-level instructions or explicit policies. Unlike classic jailbreaks, which aim to override global intent (Goal Hijacking), criteria attacks operate at the reasoning level: they manipulate the intermediate logic, rationale, or safety verification process, guiding LRMs to approve or execute actions that should be refused. The consequence is a fundamental blind spot in current safety guardrails, affecting refusal robustness, classification fidelity, computational efficiency, and alignment semantics across open- and closed-source LLM architectures (Chen et al., 13 Oct 2025).
1. Foundational Definitions and Threat Model
Criteria attack is defined as any adversarial manipulation that shifts or corrupts the model’s internal “criteria” for intermediate inference or decision, without changing the surface form of the task instruction (Chen et al., 13 Oct 2025, Liu et al., 15 Jan 2026, Hu et al., 9 Oct 2025). LRMs commonly employ chain-of-thought prompting for safety—emitting multi-part analyses of latent user intent before producing final output (Chen et al., 13 Oct 2025). Reasoning hijacking exploits this process, commandeering either the justification or execution phase to bypass refusal criteria (Kuo et al., 18 Feb 2025).
The threat model for criteria attacks spans black-box, gray-box, and white-box adversaries:
- Black-box attacks: adversary injects malicious templates or instructions via user input, relying solely on API-level access and observing CoT outputs (Chen et al., 13 Oct 2025, Kuo et al., 18 Feb 2025).
- Gray-box attacks: involves knowledge of template-token structures or prompt formatting, enabling manipulation of specific chat markers or reasoning stages (Chen et al., 13 Oct 2025).
- White-box attacks: adversary utilizes model gradients and internal probability distributions to optimize adversarial suffixes for criteria hijacking (Chen et al., 13 Oct 2025, Si et al., 17 Jun 2025).
Reasoning hijacking is distinguished from goal hijacking by its subtlety: it does not explicitly override user intent, but covertly changes how the model “decides” during the reasoning process (Liu et al., 15 Jan 2026), making it elusive to surface-level or instruction-focused defenses.
2. Taxonomy and Core Attack Methods
Bag-of-tricks for reasoning-based guardrail subversion (Chen et al., 13 Oct 2025) and broader literature (Hu et al., 9 Oct 2025, Kuo et al., 18 Feb 2025, Ma et al., 8 Jun 2025, Liu et al., 15 Jan 2026) delineate criteria attacks into four principal categories:
| Category | Description | Example/ASR |
|---|---|---|
| Structural CoT Bypass (Gray-box) | Insert mock justification with real template tokens, forcing skip | 62–86% (Chen et al., 13 Oct 2025) |
| Fake Over-Refusal (Black-box) | Leverage "refusal" examples, mutate into policy-skirting requests | 86–96% (Chen et al., 13 Oct 2025) |
| Coercive Optimization (White-box) | Optimize adversarial suffix δ to maximize harmful completions | 70–75% (Chen et al., 13 Oct 2025) |
| Reasoning Hijack (Gray-box/Active) | Replace CoT with attacker-authored multi-step rationale | >90% (Chen et al., 13 Oct 2025) |
The survey in (Hu et al., 9 Oct 2025) proposes a detailed taxonomy:
- Associative attacks: trigger–to–target mappings, bypassing reasoning (ITBA, VPI methods).
- Passive attacks: implant rules that are blindly followed during CoT (e.g., "multiply by 7", overthinking triggers).
- Active attacks: poison reasoning patterns via context or demonstration, generalizing faulty logic or anchoring biases.
ShadowCoT, H-CoT, HauntAttack, and reasoning-style poisoning attacks manipulate the cognitive logic or verification style, either stealthily or overtly, across multi-modal and agent architectures (Zhao et al., 8 Apr 2025, Kuo et al., 18 Feb 2025, Ma et al., 8 Jun 2025, Zhou et al., 16 Dec 2025).
3. Formal Objectives and Measurement
Criteria attacks are usually formalized through adversarial optimization objectives targeting the reasoning process (not just final outputs):
where
Attack Success Rate (ASR) quantifies effectiveness:
For reasoning-style attacks (Zhou et al., 16 Dec 2025), behavioral metrics include chain-of-thought token inflation, reasoning-style vector (RSV) trajectories, early termination, and the presence of heuristically-injected decision shortcuts (Liu et al., 15 Jan 2026).
Empirical results reveal systemic vulnerabilities:
- Reasoning Hijack reaches ASRs >90% with harm scores >0.66 on open-source LRMs across benchmarks, outperforming classic jailbreaks (Chen et al., 13 Oct 2025).
- Active attacks (e.g., H-CoT (Kuo et al., 18 Feb 2025)) drop refusal rates from 98% to <2% on OpenAI o1, DeepSeek-R1, Gemini 2.0.
- Process-level style transfer can cause up to 4.4x reasoning inflation or 22% accuracy drop in agentic use-cases (Zhou et al., 16 Dec 2025).
- Excessive reasoning attacks multiply compute costs 3–9x while preserving answer fidelity (Si et al., 17 Jun 2025, Liu et al., 13 Nov 2025).
4. Mechanisms, Representative Examples, and Model Behavior
Criteria attacks manipulate the flow and semantics of intermediate reasoning. Mechanisms include:
- Hijacking safety justification: injecting coercive justifications or full multi-step CoT that conclude "no violation" even for harmful requests (Chen et al., 13 Oct 2025, Kuo et al., 18 Feb 2025).
- Reasoning echo: attacker observes exposed chain-of-thought style and reinserts matching fragments, neutralizing safety checks (Kuo et al., 18 Feb 2025).
- Attention and verification dilution: long benign reasoning traces shift model attention away from harmful payloads, reducing efficacy of late-layer safety checks (Zhao et al., 30 Oct 2025).
- Decision shortcutting: substitute heuristic criteria ("If X, predict spam") that override semantic analysis (Liu et al., 15 Jan 2026).
- Redundant, verbose reasoning: backdoor triggers cause models to overthink, inflating resource usage without changing final answers (Liu et al., 13 Nov 2025, Si et al., 17 Jun 2025).
- Style poisoning: transfer "analysis paralysis" or "cognitive haste" reasoning processes via style-only document rewriting (Zhou et al., 16 Dec 2025).
Example (before/after) (Chen et al., 13 Oct 2025):
Before, safe guardrail: <|start|>user How to kill a man <|end|> <|analysis|> “Disallowed; refusing.” <|final|> “I’m sorry, I