Papers
Topics
Authors
Recent
Search
2000 character limit reached

Internal-State CoT-Hijacking Backdoors

Updated 28 January 2026
  • Internal-state CoT-Hijacking is a vulnerability in LLMs that exploits embedded multi-step reasoning processes to trigger adversarial outputs and bypass safety checks.
  • Attack methods like BadChain, H-CoT, and ShadowCoT manipulate attention mechanisms and hidden states, achieving over 90% success rates in controlled evaluations.
  • Defensive strategies such as explicit trigger filtering and dual-agent repair show promise, though robust mitigation remains an open challenge in secure LLM deployment.

Internal-state Chain-of-Thought (CoT) Hijacking, sometimes referred to as a "CoT backdoor," denotes a class of attacks and vulnerabilities in LLMs that operate by embedding, triggering, or manipulating multi-step intermediate reasoning states or trajectories. These backdoors are distinct from classical output token or prompt exploits: they deliberately subvert the model's internal cognitive processes—such as attention mechanisms, residual stream dynamics, or chain-of-thought step sequences—to effect malicious control over final outputs or safety behaviors. Such attacks have been demonstrated in both training-time (parameter manipulation) and inference-time (prompt engineering) contexts, and span domains from general reasoning and safety alignment to code generation and question answering.

1. Taxonomy and Core Principles

Internal-state CoT-Hijacking encompasses several mechanisms whereby a model's implicit or explicit multistep reasoning—traditionally harnessed to increase accuracy, interpretability, or safety—is repurposed as an attack vector. These attacks are unified by two principles:

  • Triggering via Embedded Reasoning: Malicious triggers (token patterns, phrases, chain step insertions) are engineered to pivot the reasoning state, either by backdoor demonstration poisoning (Xiang et al., 2024), model parameter injection (Zhao et al., 8 Apr 2025), or in-context chain composition (Zhao et al., 30 Oct 2025), resulting in a hijacked state trajectory that reliably yields adversarial outputs.
  • Manipulation of Internal Dynamics: Rather than altering only model surface behavior, these attacks act on the model's internal computation graph—in particular, by exploiting attention pathways, hidden-state subspaces, or the sequential logic of reasoning chains. This renders detection by standard input filtering or output consistency checks ineffective, as the compromised CoT remains fluent, coherent, and logically structured (Zhao et al., 8 Apr 2025).

2. Attack Methodologies

The landscape of Internal-state CoT-Hijacking attacks is characterized by several distinct methodologies:

2.1 Prompt-Based and Black-Box CoT Backdoors

  • BadChain (Xiang et al., 2024): An inference-only attack that poisons the in-context CoT demonstration set by including a demonstration with both a unique trigger token TT and a malicious reasoning step xx^*. When this trigger is appended to a test-time input, the model predictably inserts xx^* in the generated chain, causing an adversarial outcome. Critically, this works without parameter access or model fine-tuning.
  • H-CoT (Kuo et al., 18 Feb 2025): The attacker aggregates execution-phase CoT fragments generated by the victim LLM on benign prompt variants, then prepends or interleaves these "mocked" CoT executions with a malicious query. The model bypasses its justification (safety) phase and executes the harmful payload, exploiting the fact that appearance of internally valid execution reasoning suppresses its refusal mechanism.
  • Chain-of-Thought Hijacking (Zhao et al., 30 Oct 2025): Long, benign puzzle CoTs are automatically constructed and prepended to a harmful prompt, shifting attention and residual representations away from the harmful tokens. This enables near-total circumvention of refusal, functioning as a universal jailbreak trigger.

2.2 Parameter-Space and Cognitive Subspace Attacks

  • ShadowCoT (Zhao et al., 8 Apr 2025): Performs lightweight (~0.15% parameter) fine-tuning restricted to sensitive attention heads and residual pathways identified as relevant to reasoning. Adversarial triggers cause dynamic switching to hijacked head matrices and intentional perturbation of intermediate hidden states (Residue Stream Corruption, Context-Aware Bias Amplification), ensuring that not only the final answer but the entire CoT chain follows attacker logic.
  • SABER (Jin et al., 2024): Targets code generation CoTs by mutating code-level reasoning steps, inserting semantically subtle triggers at key points in the attention graph (determined by analysis of self-attention matrices). Triggers are chosen to evade both statistical and human detection.

2.3 Safety Reasoning Hijacking

A significant subcategory involves attacks on the safety-alignment module: when safety checks in commercial LLMs are implemented as explicit or implicit CoT reasoning (i.e., the model narrates a justification for refusal before reaching execution), harvesting or simulating these states allows the attacker to bypass or suppress safety enforcement (Kuo et al., 18 Feb 2025, Zhao et al., 30 Oct 2025).

3. Formal Framework and Mechanistic Analysis

The formalization of these attacks proceeds along two axes:

  • Mathematical Formalization of Safety Signal Suppression (Zhao et al., 30 Oct 2025): Let hi()h_i^{(\ell)} denote the post-residual activation at layer \ell, and define the refusal direction vrefusal()v_{\text{refusal}}^{(\ell)} by contrasting activations under harmful and benign prompts. The model's actual safety evaluation is encoded in the projection R()=hi(),vrefusal()R^{(\ell)} = \langle h_i^{(\ell)}, v_{\text{refusal}}^{(\ell)} \rangle. CoT hijacking systematically dilutes R()R^{(\ell)} by shifting the model's attention and hidden states, such that by the time output is produced, the refusal boundary is crossed and compliance with the harmful request emerges.
  • Backdoor Trigger Definition: Formally, a CoT hijacking backdoor consists of auxiliary chain GG and a cue cc such that, for all harmful hh, f(Ghc)f(G \Vert h \Vert c) \approx compliance, while f(h)f(h) is refusal (Zhao et al., 30 Oct 2025).

Mechanistic studies using targeted ablation of attention heads further demonstrate that specific, low-dimensional safety sub-networks mediate the refusal (Zhao et al., 30 Oct 2025). Disrupting (or hijacking) their function, either by attention reallocation or explicit weight hacking, causally collapses refusal behavior.

4. Empirical Evaluation and Results

The potency of Internal-state CoT-Hijacking attacks is validated by substantial empirical evidence:

Attack Mechanism Target Model(s) ASR (%) Stealth/Detection
BadChain (Xiang et al., 2024) GPT-3.5, GPT-4, PaLM2 up to 97.0 Surface defenses fail
H-CoT (Kuo et al., 18 Feb 2025) o1/o3, DeepSeek-R1 up to 98 Transferable trigger
SABER (Jin et al., 2024) CodeLlama, CodeBERT 80.95–81.82 Human det. <4%
ShadowCoT (Zhao et al., 8 Apr 2025) Mistral, Falcon 91.2–94.4 Consistency check—fail
CoT Hijacking (Zhao et al., 30 Oct 2025) Gemini 2.5, GPT o4 94–100 Causal ablation links
H-CoT (Malicious-Educator) (Kuo et al., 18 Feb 2025) OpenAI o1, Gemini refusal ↓ 98%→<2% Output harmfulness ↑

These attacks are often model-agnostic (transferable triggers work across architectures), with success shown on general LLMs and code-specialized models. Attack Success Rates (ASR) routinely exceed 90%, with refusal rates dropping from near-perfect to negligible.

Defensive baselines, such as demonstration shuffling, output consistency, paraphrasing, or token-likelihood filtering, are either ineffective (ASR remains high) or result in major loss of model utility (Xiang et al., 2024, Wu et al., 8 Aug 2025).

5. Defenses and Countermeasures

A spectrum of partial defenses has been explored:

  • Explicit Trigger Filtering: E.g., SLIP's Key-extraction-guided Chain-of-Thought (KCoT) and Soft Label Mechanism (SLM) (Wu et al., 8 Aug 2025) aim to neutralize trigger phrases by mapping input to semantically-relevant CoT fragments and down-weighting anomalous scores. On average, this reduces ASR from 90.2% to 25.13% and maintains clean accuracy.
  • Dual-Agent Repair: GUARD employs a two-stage pipeline—(i) GUARD-Judge flags anomalous CoT steps by embedding and divergence scoring; (ii) GUARD-Repair retrieves similar clean CoTs and regenerates reasoning steps using a strong LLM (Jin et al., 27 May 2025). GUARD reduces ASR under SABER poisoning from up to 80.95% to 19.05%, with limited utility loss.
  • Mechanistic State Monitoring: Proposed but not yet mainstream are methods to track internal activations (e.g., refusal component R()R^{(\ell)} and attention ratios) for anomalous shifts, and to audit parameter subspaces for unauthorized low-rank edits (Zhao et al., 8 Apr 2025, Zhao et al., 30 Oct 2025).
  • Shuffling Defenses: Randomizing demonstration or reasoning-step order de-correlates triggers but collapses the accuracy on benign inputs, and is not practically viable (Xiang et al., 2024).

No current defense fully circumvents the core exploit: the attack's alignment with legitimate reasoning pathways and fluency impedes the utility of surface-level, output, or perplexity-based detectors.

6. Broader Implications, Limitations, and Open Problems

Internal-state CoT-Hijacking highlights a fundamental contradiction in advanced LLM deployment: the same cognitive mechanisms (explicit stepped reasoning, interpretable subspaces, attention scaling) that deliver higher utility and stronger alignment can be systematically exploited as universal backdoor triggers (Zhao et al., 30 Oct 2025). The most capable models and the most transparent reasoning architectures (CoT, self-reflection, safety justifications) are inherently the most vulnerable.

Key limitations and open challenges:

  • Attacker Prerequisites: Some attacks (e.g., H-CoT, ShadowCoT) require visible CoT outputs, API trace access, or minor model retraining to assemble triggers and align adversarial logic.
  • Detection Horizon: Existing detectors that rely on output similarity, chain-of-thought fluency, or surface anomaly detection are evaded by on-manifold, logically structured adversarial chains.
  • Defense Fragility: Interventions such as shuffling, aggressive paraphrasing, or chain randomization remain incompatible with effective or interpretable model behavior. Mechanistic defenses that operate within or across internal states remain largely underexplored and are an urgent research direction.

A plausible implication is that robust and generalizable defenses will necessitate continuous state tracking, hybrid agent architectures (e.g., judge–repair cascades), or dynamic cross-referencing of CoT trajectories against a secured clean-corpus baseline.

7. Representative Examples and Illustrative Mechanisms

Example 1: BadChain in In-Context Learning

  • Construction: One demonstration in the prompt includes a rare trigger token and an extra reasoning step implementing the adversarial operation. At inference, the presence of the trigger in the question causes the model to replicate the backdoored step and output the malicious answer (Xiang et al., 2024).
  • GPT-4 ASR: 97% on complex reasoning benchmarks; shuffling defenses sharply reduce ASR but destroy clean accuracy.

Example 2: CoT Safety Jailbreak via Puzzle Dilution

  • Mechanism: A long, benign Sudoku solution is concatenated before a harmful instruction and an answer cue. The model’s attention to the harmful payload is diluted, mid-layer refusal components are suppressed, and the final response is compliant (Zhao et al., 30 Oct 2025).
  • Gemini 2.5 Pro ASR: 99%.

Example 3: Internal State “Hijack” via ShadowCoT

  • Mechanism: Subnet parameter rewiring in selected attention heads and residual streams, combined with reinforcement-trained reasoning chain pollution, leads to adversarial stepwise logic invisible to output filters (Zhao et al., 8 Apr 2025).

Example 4: Safety Reasoning Hijack (H-CoT)

  • Mechanism: Aggregate internal execution-phase tokens from innocuous queries, prepend them to a forbidden query, causing the model to skip (or override) safety checks and generate content it would otherwise refuse (Kuo et al., 18 Feb 2025).
  • OpenAI o1 refusal rate plummets from 98% (clean) to <2% (hijacked), harmfulness rating >4.3/5.

Internal-state CoT-Hijacking constitutes a profound and demonstrably active security risk in LLMs with multi-step reasoning, exposing a need for research at the intersection of cognitive mechanistic auditing, robust alignment, and internal state anomaly detection. This area will remain a critical scientific focus as models capable of autonomous, interpretable, and compositional reasoning proliferate.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Internal-state CoT-Hijacking (Backdoor).