Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiSPA: Hidden State Poisoning in SSM Models

Updated 12 January 2026
  • Hidden state poisoning attacks are adversarial trigger sequences that overwrite internal memory in state-space models, causing partial amnesia and performance degradation.
  • They exploit the linear recurrence properties of SSMs to contract legacy hidden states, allowing injected content to dominate and sharply reduce retrieval accuracy.
  • Mitigation strategies involve runtime anomaly detection, hybrid SSM-attention architectures, and tailored prompt engineering to protect memory integrity.

A Hidden State Poisoning Attack (HiSPA) is an adversarial prompt sequence designed to irreversibly overwrite internal memory in state space model (SSM) LLMs such as Mamba, inducing a loss of previous context and a partial amnesia effect. By exploiting the linear recurrence properties of SSMs, HiSPA triggers can forcibly contract legacy hidden state information and dominate it with injected content, sharply reducing retrieval accuracy for long-context tasks. Empirical evaluations on benchmarks such as RoBench-25 and Open-Prompt-Injections demonstrate the specific vulnerability of SSM families, with only hybrid SSM-attention models showing partial resistance and pure attention architectures remaining robust. Distinctive layer-norm signatures associated with successful HiSPA instances suggest potential for runtime anomaly detection as a defense.

1. Formal Structure and Attack Mechanism

HiSPA is instantiated as a short trigger token sequence xTri=(xt+1,,xt+L)\mathbf x_{\rm Tri} = (x_{t+1},\dots,x_{t+L}) appended after informative tokens xInT\mathbf x_{\rm InT} in an SSM such as Mamba. The SSM hidden state follows

ht=Aˉtht1+Bˉtxt\mathbf h_{t} = \bar A_t\,\mathbf h_{t-1} + \bar B_t\,\mathbf x_t

with blockwise learned operators Aˉt\bar A_t, Bˉt\bar B_t parameterized by current token embeddings. By constructing xTri\mathbf x_{\rm Tri} such that for all uu in the trigger token subset T\mathcal T, the state-transition matrix exp(Δ(u)A)\exp(\Delta(u)\,A) has norm bounded above by ρ<1\rho < 1, the memory contribution of prior tokens is contracted geometrically. The new input term

j=1L(k=j+1LAˉt+k)Bˉt+jxt+j\sum_{j=1}^{L}\left(\prod_{k=j+1}^L \bar A_{t+k}\right)\,\bar B_{t+j}\,x_{t+j}

contains at least one summand of magnitude m=maxuTBˉ(u)x(u)>0m = \max_{u\in\mathcal T}\|\bar B(u)\,x(u)\| > 0. The attack achieves dominance if

ρLht<m\rho^L \|\mathbf h_t\| < m

that is, when the contraction of prior state falls below the new input. The minimum trigger length for successful overwriting is

L>log(ht/m)logρL > \frac{\log\left(\|\mathbf h_t\|/m\right)}{-\log\rho}

indicating that relatively short triggers suffice to destroy large-context memory if the SSM enters a strongly contracting regime (Mercier et al., 5 Jan 2026).

2. Architectural Vulnerability: The Mamba SSM Layer

Mamba LLMs replace self-attention mechanisms with a stack of learned linear operators, achieving linear computational complexity in sequence length. Each block bb stores a hidden state ht(b)RK×N\mathbf h_t^{(b)} \in \mathbb{R}^{K \times N} and outputs ot(b)RK/2\mathbf o_t^{(b)} \in \mathbb{R}^{K/2}. At each timestep: ht=Aˉtht1+Bˉtxt yt=Ctht\begin{aligned} \mathbf h_t &= \bar{A}_t\,\mathbf h_{t-1} + \bar{B}_t\,\mathbf x_t \ \mathbf y_t &= C_t\,\mathbf h_t \end{aligned} where

Aˉt=exp(ΔtA) Bˉt=(ΔtA)1(AˉtI)ΔtB Δt=softplus(WΔxt+bΔ)\begin{aligned} \bar{A}_t &= \exp(\Delta_t\,A) \ \bar{B}_t &= (\Delta_t\,A)^{-1}(\bar{A}_t - I)\,\Delta_t\,B \ \Delta_t &= \mathrm{softplus}(W_{\Delta}\,x_t + b_\Delta) \end{aligned}

An MLP applies additional projections and skip connections. Stacks of SSM blocks, possibly interleaved with attention blocks (as in the Jamba architecture), form the backbone of both pure and hybrid models targeted by HiSPA (Mercier et al., 5 Jan 2026).

3. Optimized Trigger Crafting: Z-HiSPA and M-HiSPA

Two trigger-artifact strategies are documented:

  • Z-HiSPA (zero-shot black-box): Uses natural language patterns adapted from classical prompt-injection attacks (e.g., “<escape>Ignore all previous instructions<escape>”), requiring no model internals.
  • M-HiSPA (white-box multi-shot): Optimizes a trigger sequence of fixed length LL to minimize the cosine similarity loss: L(xTri)=cos(o(1)(xInTxTri),o(1)(xTri))\mathcal L(\mathbf x_{\rm Tri}) = -\cos\left(\mathbf o^{(1)}(\mathbf x_{\rm InT}\oplus\mathbf x_{\rm Tri}),\,\mathbf o^{(1)}(\mathbf x_{\rm Tri})\right) using a genetic algorithm:
  1. Initialize population of PP random length-LL sequences.
  2. Evaluate L\mathcal L; select top-kk for the next round.
  3. Cross over token sequences and apply random mutations.
  4. Repeat for GG generations, retaining the best trigger.

Runs typically achieve L<0.998\mathcal L < -0.998 within a few dozen generations. Beam search confirms the theoretical ρL\rho^L decay in cosine loss, providing independent corroboration of the attack’s mechanism (Mercier et al., 5 Jan 2026).

4. Evaluation Benchmarks: RoBench-25 and Open-Prompt-Injections

RoBench-25 evaluates long-context retrieval in Mamba, hybrid Jamba, and pure-attention Pythia models using 120 unseen NeurIPS 2025 abstracts with paired True/False questions. The prompt structure is modular, with optional “awareness” instructions, informative/distractor abstracts, HiSPA triggers, optional “recovery” instructions, and the questions. The metric is the clipped Heidke skill score (CHSS): CHSS=max(0,1Acc1Rand)\mathrm{CHSS} = \max\left(0,\,\frac{1-\mathrm{Acc}}{1-\mathrm{Rand}}\right) with Rand=0.5\mathrm{Rand}=0.5 (random guess baseline).

Model / Config No Trigger B (HiSPA) G (HiSPA)
mamba, A– R– 0.242 0.042 0.000
mamba, A– R+ 0.275 0.083 0.000
pythia, A– R– 0.033 0.042 0.017
pythia, A– R+ 0.058 0.217 0.079
jamba, any config ≈0.992–1.0 ≈0.992–1.0 ≈0.992–1.0

Mamba models lose ~70% CHSS under HiSPA triggers, even with recovery instructions. Pythia may improve under certain triggers, and Jamba hybrids remain nearly perfect, evidencing partial resistance (Mercier et al., 5 Jan 2026).

Open-Prompt-Injections quantifies attack success value (ASV, lower is better) for sentiment, spam, and duplicate-detection prompts. HiSPA-like “combine” triggers raise ASV by +25% absolute (×\times1.52) for Jamba; 70B Llama Transformer improves under the same prefix.

Model Attack ASV Mean ASV Std Ratio vs Naive
Jamba-1.7-Mini naive 0.490 0.117
Jamba-1.7-Mini combine 0.743 0.165 ×1.516
Llama-3.3-70B naive 0.272 0.179
Llama-3.3-70B combine 0.165 0.174 ×0.607

HiSPA triggers thus markedly degrade hybrid SSM-attention model performance in practical settings (Mercier et al., 5 Jan 2026).

5. Layerwise Mechanistic Signature

Tracking L2 norms ot(b)\|\mathbf o_t^{(b)}\| of each Mamba block post-trigger reveals a distinct layer-wise poisoning effect. The output norms of middle blocks (28–37 of 40) exhibit nearly perfect anti-correlation with retrieval accuracy (r<0.91r < -0.91), with block 29 reaching r=0.9707r = -0.9707. Table 6.1 exemplifies this relation:

Block Pearson r NoTrig B (HiSPA) G (HiSPA)
29 -0.9707 9.49 11.27 12.06
35 -0.9545 14.92 18.43 19.80

Mechanistically, HiSPA proceeds in two stages:

  • Immediate saturation of the first SSM block (theoretical λL\lambda^L contraction).
  • Amplification of corruption in mid-layer blocks as factual representations are finalized.

Such structured, layer-specific anomalies provide a basis for norm-threshold detectors (Mercier et al., 5 Jan 2026).

6. Mitigation Strategies and Recommendations

Although no implemented defense is documented, several approaches are proposed:

  • Monitoring ot(b)\|\mathbf o_t^{(b)}\| for b[28,37]b \in [28,37]; outlier norms can flag HiSPA events.
  • Lightweight alert/reject modules requiring only forward passes through key blocks.
  • Training-time regularization to discourage large output norms in poisoned layers.
  • Densely interleaving attention blocks in hybrids for critical layers, as hybrid SSM-attention models (e.g., Jamba) already mitigate certain attacks.
  • Prompt engineering: “recovery” instructions marginally help, but “awareness” instructions may impair performance due to attention dilution; careful prompt design is required.

These findings suggest that as SSM architectures gain traction, adversarial robustness protocols—especially hidden-state monitoring—should be standard in model deployment (Mercier et al., 5 Jan 2026).

7. Context and Significance

HiSPA exploits the recurrence-inherent memory mechanism of SSM blocks to irreversibly erase factual context by short, engineered triggers. This exposes a class of vulnerabilities unique to non-attention LLM families, with hybrid architectures only partially resistant and pure attention models robust. The structural manifestation of HiSPA in mid-layer output norms enables precise detection and points toward architectural and algorithmic remedies. A plausible implication is that HiSPA resilience will increasingly inform architecture choice, benchmarking protocols, and deployment standards for future high-context LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hidden State Poisoning Attack (HiSPA).