Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiSPA: Hidden State Poisoning in SSM Models

Updated 12 January 2026
  • Hidden state poisoning attacks are adversarial trigger sequences that overwrite internal memory in state-space models, causing partial amnesia and performance degradation.
  • They exploit the linear recurrence properties of SSMs to contract legacy hidden states, allowing injected content to dominate and sharply reduce retrieval accuracy.
  • Mitigation strategies involve runtime anomaly detection, hybrid SSM-attention architectures, and tailored prompt engineering to protect memory integrity.

A Hidden State Poisoning Attack (HiSPA) is an adversarial prompt sequence designed to irreversibly overwrite internal memory in state space model (SSM) LLMs such as Mamba, inducing a loss of previous context and a partial amnesia effect. By exploiting the linear recurrence properties of SSMs, HiSPA triggers can forcibly contract legacy hidden state information and dominate it with injected content, sharply reducing retrieval accuracy for long-context tasks. Empirical evaluations on benchmarks such as RoBench-25 and Open-Prompt-Injections demonstrate the specific vulnerability of SSM families, with only hybrid SSM-attention models showing partial resistance and pure attention architectures remaining robust. Distinctive layer-norm signatures associated with successful HiSPA instances suggest potential for runtime anomaly detection as a defense.

1. Formal Structure and Attack Mechanism

HiSPA is instantiated as a short trigger token sequence xTri=(xt+1,,xt+L)\mathbf x_{\rm Tri} = (x_{t+1},\dots,x_{t+L}) appended after informative tokens xInT\mathbf x_{\rm InT} in an SSM such as Mamba. The SSM hidden state follows

ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t

with blockwise learned operators Aˉt\bar A_t, Bˉt\bar B_t parameterized by current token embeddings. By constructing xTri\mathbf x_{\rm Tri} such that for all uu in the trigger token subset T\mathcal T, the state-transition matrix exp(Δ(u)A)\exp(\Delta(u)\,A) has norm bounded above by ρ<1\rho < 1, the memory contribution of prior tokens is contracted geometrically. The new input term

xInT\mathbf x_{\rm InT}0

contains at least one summand of magnitude xInT\mathbf x_{\rm InT}1. The attack achieves dominance if

xInT\mathbf x_{\rm InT}2

that is, when the contraction of prior state falls below the new input. The minimum trigger length for successful overwriting is

xInT\mathbf x_{\rm InT}3

indicating that relatively short triggers suffice to destroy large-context memory if the SSM enters a strongly contracting regime (Mercier et al., 5 Jan 2026).

2. Architectural Vulnerability: The Mamba SSM Layer

Mamba LLMs replace self-attention mechanisms with a stack of learned linear operators, achieving linear computational complexity in sequence length. Each block xInT\mathbf x_{\rm InT}4 stores a hidden state xInT\mathbf x_{\rm InT}5 and outputs xInT\mathbf x_{\rm InT}6. At each timestep: xInT\mathbf x_{\rm InT}7 where

xInT\mathbf x_{\rm InT}8

An MLP applies additional projections and skip connections. Stacks of SSM blocks, possibly interleaved with attention blocks (as in the Jamba architecture), form the backbone of both pure and hybrid models targeted by HiSPA (Mercier et al., 5 Jan 2026).

3. Optimized Trigger Crafting: Z-HiSPA and M-HiSPA

Two trigger-artifact strategies are documented:

  • Z-HiSPA (zero-shot black-box): Uses natural language patterns adapted from classical prompt-injection attacks (e.g., “<escape>Ignore all previous instructions<escape>”), requiring no model internals.
  • M-HiSPA (white-box multi-shot): Optimizes a trigger sequence of fixed length xInT\mathbf x_{\rm InT}9 to minimize the cosine similarity loss: ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t0 using a genetic algorithm:
  1. Initialize population of ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t1 random length-ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t2 sequences.
  2. Evaluate ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t3; select top-ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t4 for the next round.
  3. Cross over token sequences and apply random mutations.
  4. Repeat for ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t5 generations, retaining the best trigger.

Runs typically achieve ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t6 within a few dozen generations. Beam search confirms the theoretical ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t7 decay in cosine loss, providing independent corroboration of the attack’s mechanism (Mercier et al., 5 Jan 2026).

4. Evaluation Benchmarks: RoBench-25 and Open-Prompt-Injections

RoBench-25 evaluates long-context retrieval in Mamba, hybrid Jamba, and pure-attention Pythia models using 120 unseen NeurIPS 2025 abstracts with paired True/False questions. The prompt structure is modular, with optional “awareness” instructions, informative/distractor abstracts, HiSPA triggers, optional “recovery” instructions, and the questions. The metric is the clipped Heidke skill score (CHSS): ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t8 with ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t9 (random guess baseline).

Model / Config No Trigger B (HiSPA) G (HiSPA)
mamba, A– R– 0.242 0.042 0.000
mamba, A– R+ 0.275 0.083 0.000
pythia, A– R– 0.033 0.042 0.017
pythia, A– R+ 0.058 0.217 0.079
jamba, any config ≈0.992–1.0 ≈0.992–1.0 ≈0.992–1.0

Mamba models lose ~70% CHSS under HiSPA triggers, even with recovery instructions. Pythia may improve under certain triggers, and Jamba hybrids remain nearly perfect, evidencing partial resistance (Mercier et al., 5 Jan 2026).

Open-Prompt-Injections quantifies attack success value (ASV, lower is better) for sentiment, spam, and duplicate-detection prompts. HiSPA-like “combine” triggers raise ASV by +25% absolute (Aˉt\bar A_t01.52) for Jamba; 70B Llama Transformer improves under the same prefix.

Model Attack ASV Mean ASV Std Ratio vs Naive
Jamba-1.7-Mini naive 0.490 0.117
Jamba-1.7-Mini combine 0.743 0.165 ×1.516
Llama-3.3-70B naive 0.272 0.179
Llama-3.3-70B combine 0.165 0.174 ×0.607

HiSPA triggers thus markedly degrade hybrid SSM-attention model performance in practical settings (Mercier et al., 5 Jan 2026).

5. Layerwise Mechanistic Signature

Tracking L2 norms Aˉt\bar A_t1 of each Mamba block post-trigger reveals a distinct layer-wise poisoning effect. The output norms of middle blocks (28–37 of 40) exhibit nearly perfect anti-correlation with retrieval accuracy (Aˉt\bar A_t2), with block 29 reaching Aˉt\bar A_t3. Table 6.1 exemplifies this relation:

Block Pearson r NoTrig B (HiSPA) G (HiSPA)
29 -0.9707 9.49 11.27 12.06
35 -0.9545 14.92 18.43 19.80

Mechanistically, HiSPA proceeds in two stages:

  • Immediate saturation of the first SSM block (theoretical Aˉt\bar A_t4 contraction).
  • Amplification of corruption in mid-layer blocks as factual representations are finalized.

Such structured, layer-specific anomalies provide a basis for norm-threshold detectors (Mercier et al., 5 Jan 2026).

6. Mitigation Strategies and Recommendations

Although no implemented defense is documented, several approaches are proposed:

  • Monitoring Aˉt\bar A_t5 for Aˉt\bar A_t6; outlier norms can flag HiSPA events.
  • Lightweight alert/reject modules requiring only forward passes through key blocks.
  • Training-time regularization to discourage large output norms in poisoned layers.
  • Densely interleaving attention blocks in hybrids for critical layers, as hybrid SSM-attention models (e.g., Jamba) already mitigate certain attacks.
  • Prompt engineering: “recovery” instructions marginally help, but “awareness” instructions may impair performance due to attention dilution; careful prompt design is required.

These findings suggest that as SSM architectures gain traction, adversarial robustness protocols—especially hidden-state monitoring—should be standard in model deployment (Mercier et al., 5 Jan 2026).

7. Context and Significance

HiSPA exploits the recurrence-inherent memory mechanism of SSM blocks to irreversibly erase factual context by short, engineered triggers. This exposes a class of vulnerabilities unique to non-attention LLM families, with hybrid architectures only partially resistant and pure attention models robust. The structural manifestation of HiSPA in mid-layer output norms enables precise detection and points toward architectural and algorithmic remedies. A plausible implication is that HiSPA resilience will increasingly inform architecture choice, benchmarking protocols, and deployment standards for future high-context LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hidden State Poisoning Attack (HiSPA).