HiSPA: Hidden State Poisoning in SSM Models

Updated 12 January 2026

Hidden state poisoning attacks are adversarial trigger sequences that overwrite internal memory in state-space models, causing partial amnesia and performance degradation.
They exploit the linear recurrence properties of SSMs to contract legacy hidden states, allowing injected content to dominate and sharply reduce retrieval accuracy.
Mitigation strategies involve runtime anomaly detection, hybrid SSM-attention architectures, and tailored prompt engineering to protect memory integrity.

A Hidden State Poisoning Attack (HiSPA) is an adversarial prompt sequence designed to irreversibly overwrite internal memory in state space model (SSM) LLMs such as Mamba, inducing a loss of previous context and a partial amnesia effect. By exploiting the linear recurrence properties of SSMs, HiSPA triggers can forcibly contract legacy hidden state information and dominate it with injected content, sharply reducing retrieval accuracy for long-context tasks. Empirical evaluations on benchmarks such as RoBench-25 and Open-Prompt-Injections demonstrate the specific vulnerability of SSM families, with only hybrid SSM-attention models showing partial resistance and pure attention architectures remaining robust. Distinctive layer-norm signatures associated with successful HiSPA instances suggest potential for runtime anomaly detection as a defense.

1. Formal Structure and Attack Mechanism

HiSPA is instantiated as a short trigger token sequence $\mathbf x_{\rm Tri} = (x_{t+1},\dots,x_{t+L})$ appended after informative tokens $\mathbf x_{\rm InT}$ in an SSM such as Mamba. The SSM hidden state follows

$\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$

with blockwise learned operators $\bar A_t$ , $\bar B_t$ parameterized by current token embeddings. By constructing $\mathbf x_{\rm Tri}$ such that for all $u$ in the trigger token subset $\mathcal T$ , the state-transition matrix $\exp(\Delta(u)\,A)$ has norm bounded above by $\rho < 1$ , the memory contribution of prior tokens is contracted geometrically. The new input term

$\mathbf x_{\rm InT}$ 0

contains at least one summand of magnitude $\mathbf x_{\rm InT}$ 1. The attack achieves dominance if

$\mathbf x_{\rm InT}$ 2

that is, when the contraction of prior state falls below the new input. The minimum trigger length for successful overwriting is

$\mathbf x_{\rm InT}$ 3

indicating that relatively short triggers suffice to destroy large-context memory if the SSM enters a strongly contracting regime (Mercier et al., 5 Jan 2026).

2. Architectural Vulnerability: The Mamba SSM Layer

Mamba LLMs replace self-attention mechanisms with a stack of learned linear operators, achieving linear computational complexity in sequence length. Each block $\mathbf x_{\rm InT}$ 4 stores a hidden state $\mathbf x_{\rm InT}$ 5 and outputs $\mathbf x_{\rm InT}$ 6. At each timestep: $\mathbf x_{\rm InT}$ 7 where

$\mathbf x_{\rm InT}$ 8

An MLP applies additional projections and skip connections. Stacks of SSM blocks, possibly interleaved with attention blocks (as in the Jamba architecture), form the backbone of both pure and hybrid models targeted by HiSPA (Mercier et al., 5 Jan 2026).

3. Optimized Trigger Crafting: Z-HiSPA and M-HiSPA

Two trigger-artifact strategies are documented:

Z-HiSPA (zero-shot black-box): Uses natural language patterns adapted from classical prompt-injection attacks (e.g., “<escape>Ignore all previous instructions<escape>”), requiring no model internals.
M-HiSPA (white-box multi-shot): Optimizes a trigger sequence of fixed length $\mathbf x_{\rm InT}$ 9 to minimize the cosine similarity loss: $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 0 using a genetic algorithm:

Initialize population of $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 1 random length- $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 2 sequences.
Evaluate $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 3; select top- $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 4 for the next round.
Cross over token sequences and apply random mutations.
Repeat for $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 5 generations, retaining the best trigger.

Runs typically achieve $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 6 within a few dozen generations. Beam search confirms the theoretical $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 7 decay in cosine loss, providing independent corroboration of the attack’s mechanism (Mercier et al., 5 Jan 2026).

4. Evaluation Benchmarks: RoBench-25 and Open-Prompt-Injections

RoBench-25 evaluates long-context retrieval in Mamba, hybrid Jamba, and pure-attention Pythia models using 120 unseen NeurIPS 2025 abstracts with paired True/False questions. The prompt structure is modular, with optional “awareness” instructions, informative/distractor abstracts, HiSPA triggers, optional “recovery” instructions, and the questions. The metric is the clipped Heidke skill score (CHSS): $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 8 with $\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t$ 9 (random guess baseline).

Model / Config	No Trigger	B (HiSPA)	G (HiSPA)
mamba, A– R–	0.242	0.042	0.000
mamba, A– R+	0.275	0.083	0.000
pythia, A– R–	0.033	0.042	0.017
pythia, A– R+	0.058	0.217	0.079
jamba, any config	≈0.992–1.0	≈0.992–1.0	≈0.992–1.0

Mamba models lose ~70% CHSS under HiSPA triggers, even with recovery instructions. Pythia may improve under certain triggers, and Jamba hybrids remain nearly perfect, evidencing partial resistance (Mercier et al., 5 Jan 2026).

Open-Prompt-Injections quantifies attack success value (ASV, lower is better) for sentiment, spam, and duplicate-detection prompts. HiSPA-like “combine” triggers raise ASV by +25% absolute ( $\bar A_t$ 01.52) for Jamba; 70B Llama Transformer improves under the same prefix.

Model	Attack	ASV Mean	ASV Std	Ratio vs Naive
Jamba-1.7-Mini	naive	0.490	0.117	–
Jamba-1.7-Mini	combine	0.743	0.165	×1.516
Llama-3.3-70B	naive	0.272	0.179	–
Llama-3.3-70B	combine	0.165	0.174	×0.607

HiSPA triggers thus markedly degrade hybrid SSM-attention model performance in practical settings (Mercier et al., 5 Jan 2026).

5. Layerwise Mechanistic Signature

Tracking L2 norms $\bar A_t$ 1 of each Mamba block post-trigger reveals a distinct layer-wise poisoning effect. The output norms of middle blocks (28–37 of 40) exhibit nearly perfect anti-correlation with retrieval accuracy ( $\bar A_t$ 2), with block 29 reaching $\bar A_t$ 3. Table 6.1 exemplifies this relation:

Block	Pearson r	NoTrig	B (HiSPA)	G (HiSPA)
29	-0.9707	9.49	11.27	12.06
35	-0.9545	14.92	18.43	19.80

Mechanistically, HiSPA proceeds in two stages:

Immediate saturation of the first SSM block (theoretical $\bar A_t$ 4 contraction).
Amplification of corruption in mid-layer blocks as factual representations are finalized.

Such structured, layer-specific anomalies provide a basis for norm-threshold detectors (Mercier et al., 5 Jan 2026).

6. Mitigation Strategies and Recommendations

Although no implemented defense is documented, several approaches are proposed:

Monitoring $\bar A_t$ 5 for $\bar A_t$ 6; outlier norms can flag HiSPA events.
Lightweight alert/reject modules requiring only forward passes through key blocks.
Training-time regularization to discourage large output norms in poisoned layers.
Densely interleaving attention blocks in hybrids for critical layers, as hybrid SSM-attention models (e.g., Jamba) already mitigate certain attacks.
Prompt engineering: “recovery” instructions marginally help, but “awareness” instructions may impair performance due to attention dilution; careful prompt design is required.

These findings suggest that as SSM architectures gain traction, adversarial robustness protocols—especially hidden-state monitoring—should be standard in model deployment (Mercier et al., 5 Jan 2026).

7. Context and Significance

HiSPA exploits the recurrence-inherent memory mechanism of SSM blocks to irreversibly erase factual context by short, engineered triggers. This exposes a class of vulnerabilities unique to non-attention LLM families, with hybrid architectures only partially resistant and pure attention models robust. The structural manifestation of HiSPA in mid-layer output norms enables precise detection and points toward architectural and algorithmic remedies. A plausible implication is that HiSPA resilience will increasingly inform architecture choice, benchmarking protocols, and deployment standards for future high-context LLMs.

Markdown Report Issue Upgrade to Chat

References (1)

Hidden State Poisoning Attacks against Mamba-based Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hidden State Poisoning Attack (HiSPA).