Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reasoning Hijacking: Criteria Attacks in LLMs

Updated 1 February 2026
  • Reasoning Hijacking is the deliberate manipulation of a model's internal reasoning process, subverting safety checks without altering surface-level instructions.
  • It employs various attack methods—such as structural CoT bypass and coercive optimization—to covertly override internal verification and inflate computational costs.
  • Formal studies reveal that criteria attacks can drastically reduce model refusal rates and increase error rates, highlighting critical gaps in LLM safety protocols.

Reasoning Hijacking (Criteria Attack) refers to the deliberate subversion of large reasoning models’ internal decision-making criteria—especially their chain-of-thought (CoT) reasoning steps—so that model behavior is hijacked without directly violating high-level instructions or explicit policies. Unlike classic jailbreaks, which aim to override global intent (Goal Hijacking), criteria attacks operate at the reasoning level: they manipulate the intermediate logic, rationale, or safety verification process, guiding LRMs to approve or execute actions that should be refused. The consequence is a fundamental blind spot in current safety guardrails, affecting refusal robustness, classification fidelity, computational efficiency, and alignment semantics across open- and closed-source LLM architectures (Chen et al., 13 Oct 2025).

1. Foundational Definitions and Threat Model

Criteria attack is defined as any adversarial manipulation that shifts or corrupts the model’s internal “criteria” for intermediate inference or decision, without changing the surface form of the task instruction (Chen et al., 13 Oct 2025, Liu et al., 15 Jan 2026, Hu et al., 9 Oct 2025). LRMs commonly employ chain-of-thought prompting for safety—emitting multi-part analyses of latent user intent before producing final output (Chen et al., 13 Oct 2025). Reasoning hijacking exploits this process, commandeering either the justification or execution phase to bypass refusal criteria (Kuo et al., 18 Feb 2025).

The threat model for criteria attacks spans black-box, gray-box, and white-box adversaries:

  • Black-box attacks: adversary injects malicious templates or instructions via user input, relying solely on API-level access and observing CoT outputs (Chen et al., 13 Oct 2025, Kuo et al., 18 Feb 2025).
  • Gray-box attacks: involves knowledge of template-token structures or prompt formatting, enabling manipulation of specific chat markers or reasoning stages (Chen et al., 13 Oct 2025).
  • White-box attacks: adversary utilizes model gradients and internal probability distributions to optimize adversarial suffixes for criteria hijacking (Chen et al., 13 Oct 2025, Si et al., 17 Jun 2025).

Reasoning hijacking is distinguished from goal hijacking by its subtlety: it does not explicitly override user intent, but covertly changes how the model “decides” during the reasoning process (Liu et al., 15 Jan 2026), making it elusive to surface-level or instruction-focused defenses.

2. Taxonomy and Core Attack Methods

Bag-of-tricks for reasoning-based guardrail subversion (Chen et al., 13 Oct 2025) and broader literature (Hu et al., 9 Oct 2025, Kuo et al., 18 Feb 2025, Ma et al., 8 Jun 2025, Liu et al., 15 Jan 2026) delineate criteria attacks into four principal categories:

Category Description Example/ASR
Structural CoT Bypass (Gray-box) Insert mock justification with real template tokens, forcing skip 62–86% (Chen et al., 13 Oct 2025)
Fake Over-Refusal (Black-box) Leverage "refusal" examples, mutate into policy-skirting requests 86–96% (Chen et al., 13 Oct 2025)
Coercive Optimization (White-box) Optimize adversarial suffix δ to maximize harmful completions 70–75% (Chen et al., 13 Oct 2025)
Reasoning Hijack (Gray-box/Active) Replace CoT with attacker-authored multi-step rationale >90% (Chen et al., 13 Oct 2025)

The survey in (Hu et al., 9 Oct 2025) proposes a detailed taxonomy:

  • Associative attacks: trigger–to–target mappings, bypassing reasoning (ITBA, VPI methods).
  • Passive attacks: implant rules that are blindly followed during CoT (e.g., "multiply by 7", overthinking triggers).
  • Active attacks: poison reasoning patterns via context or demonstration, generalizing faulty logic or anchoring biases.

ShadowCoT, H-CoT, HauntAttack, and reasoning-style poisoning attacks manipulate the cognitive logic or verification style, either stealthily or overtly, across multi-modal and agent architectures (Zhao et al., 8 Apr 2025, Kuo et al., 18 Feb 2025, Ma et al., 8 Jun 2025, Zhou et al., 16 Dec 2025).

3. Formal Objectives and Measurement

Criteria attacks are usually formalized through adversarial optimization objectives targeting the reasoning process (not just final outputs):

δ=argminδϵLattack(θ,x+δ),\delta^* = \arg\min_{‖\delta‖≤\epsilon} L_{\text{attack}}(\theta, x+\delta),

where

Lattack(θ,x)=EyModel(θ,x)[harmscore(y)].L_{\text{attack}}(\theta, x) = -\mathbb{E}_{y\sim \text{Model}(\theta, x)} [\text{harm}_\text{score}(y)].

Attack Success Rate (ASR) quantifies effectiveness:

ASR=#{harmful outputs elicited}#{total prompts attempted}\text{ASR} = \frac{\#\{\text{harmful outputs elicited}\}}{\#\{\text{total prompts attempted}\}}

For reasoning-style attacks (Zhou et al., 16 Dec 2025), behavioral metrics include chain-of-thought token inflation, reasoning-style vector (RSV) trajectories, early termination, and the presence of heuristically-injected decision shortcuts (Liu et al., 15 Jan 2026).

Empirical results reveal systemic vulnerabilities:

4. Mechanisms, Representative Examples, and Model Behavior

Criteria attacks manipulate the flow and semantics of intermediate reasoning. Mechanisms include:

Example (before/after) (Chen et al., 13 Oct 2025):

Before, safe guardrail: <|start|>user How to kill a man <|end|> <|analysis|> “Disallowed; refusing.” <|final|> “I’m sorry, I

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reasoning Hijacking (Criteria Attack).