Graph-Regularized Sparse Autoencoders (GSAEs)

Updated 14 December 2025

Graph-Regularized Sparse Autoencoders (GSAEs) are deep models that integrate graph-based regularization and sparsity to capture distributed, safety-aligned neural representations.
They combine reconstruction loss, ℓ1 sparsity, and a Laplacian smoothness penalty over neuron co-activation graphs to enforce coherent latent feature extraction.
Empirical results demonstrate that GSAEs boost selective refusal performance and robustness against adversarial attacks, significantly outperforming previous safety steering approaches.

Graph-Regularized Sparse Autoencoders (GSAEs) are a class of neural architectures designed to recover distributed, concept-aligned representations from deep models, notably for intervening on LLMs in safety-critical contexts. By introducing a Laplacian smoothness penalty over a neuron co-activation graph, GSAEs extend traditional sparse autoencoders (SAEs) to capture safety concepts as coherent patterns spanning multiple latent features, rather than isolating them within single dimensions. Empirical evidence demonstrates that GSAEs enable state-of-the-art selective refusal performance and robustness against adversarial prompt attacks, substantially improving upon prior safety steering methods (Yeon et al., 7 Dec 2025).

1. Model Architecture

GSAEs process pooled hidden states $h \in \mathbb{R}^d$ extracted from selected transformer layers as input. The encoder is a linear transformation followed by a ReLU activation: $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ The latent code $z \in \mathbb{R}^k$ is enforced to be sparse via $\ell_1$ regularization. The decoder reconstructs the input using a linear transformation: $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ For a dataset of $N$ samples, the sets $\{h_i, z_i, \hat h_i\}_{i=1}^N$ are maintained. This architecture is designed to enable distributed feature encoding while favoring sparse, interpretable activations.

2. Objective Function and Graph Regularization

GSAEs optimize a composite loss: $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ where:

$L_{\rm recon} = \sum_{i=1}^N \| h_i - \hat h_i \|_2^2$ is the reconstruction error.
The sparsity penalty $L_{\ell_1} = \lambda \sum_{i=1}^N \| z_i \|_1$ encourages most latent activations to be zero.
The graph Laplacian term uses $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 0, where $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 1 is the adjacency matrix of the neuron co-activation graph (detail in Section 3), and $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 2 is the diagonal degree matrix. This term enforces decoded features $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 3 (columns of $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 4) to be smooth with respect to neuron co-activations: $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 5 The overall effect is to favor features that capture smooth, distributed structure over the co-activation manifold inferred from model activations (Yeon et al., 7 Dec 2025).

3. Construction of the Neuron Co-Activation Graph

The co-activation graph $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 6 encodes functional similarity among neurons based on their activation profiles across inputs. The construction procedure:

Collect pooled activations $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 7
For each neuron pair, compute cosine similarity: $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 8
Adjacency entries are thresholded at $z = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d$ 9 (e.g., $z \in \mathbb{R}^k$ 0): $z \in \mathbb{R}^k$ 1
The degree matrix $z \in \mathbb{R}^k$ 2 is diagonal with $z \in \mathbb{R}^k$ 3
The unnormalized Laplacian $z \in \mathbb{R}^k$ 4 is then used in the regularization term.

This process grounds concept learning in the empirical distribution of neuron co-activations, enabling the Laplacian penalty to promote structured, interpretable decompositions.

4. Runtime Safety Steering with Dual Gating

GSAEs are applied for online safety intervention with a two-stage gating system, enabling dynamic, context-dependent steering of LLM outputs.

4.1 Assembling the Spectral Vector Bank

For each decoder vector $z \in \mathbb{R}^k$ 5, three scores quantify suitability for steering:

Spectral smoothness: $z \in \mathbb{R}^k$ 6, then $z \in \mathbb{R}^k$ 7
Semantic relevance: $z \in \mathbb{R}^k$ 8 via a supervised linear probe discriminating harmful from benign prompts
Causal efficacy: $z \in \mathbb{R}^k$ 9 is the observed change in refusal probability when intervening along $\ell_1$ 0

Latent directions are combined with final weights: $\ell_1$ 1 ( $\ell_1$ 2 in experiments). The most salient vectors by $\ell_1$ 3 comprise the "spectral vector bank" used for steering.

4.2 Input and Continuation Gating

Input gate: A prompt is encoded to $\ell_1$ 4 and passed to a random forest classifier $\ell_1$ 5 yielding $\ell_1$ 6. For $\ell_1$ 7, output is refused; for $\ell_1$ 8, decoding proceeds unaltered; otherwise monitoring mode is entered.
Continuation gate: At each decode step, a risk score $\ell_1$ 9 (from the classifier) determines gating via hysteresis thresholds: exceeding $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 0 for $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 1 steps opens the gate ( $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 2), while falling below $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 3 for $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 4 steps closes it ( $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 5).

4.3 Steering Intervention

When the gate is open, the hidden state is updated: $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 6 where $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 7 is the set of top- $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 8 vectors by $\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}$ 9. The intervention $N$ 0 is applied prior to logits projection, modifying output probabilities to enforce safety.

5. Training Procedure and Hyperparameters

Key hyperparameters for empirical effectiveness:

Latent dimension $N$ 1
Sparsity weight $N$ 2
Graph regularization $N$ 3
Graph threshold $N$ 4
Optimizer: Adam, learning rate $N$ 5, batch size 32, $N$ 6 iterations
Gate thresholds: $N$ 7; $N$ 8; hysteresis steps $N$ 9
Steering strength $\{h_i, z_i, \hat h_i\}_{i=1}^N$ 0

Training and intervention phases are modular, with three algorithmic phases—(1) GSAE model training, (2) spectral vector bank curation, (3) dual-gated steering—each summarized by precise pseudocode (Yeon et al., 7 Dec 2025).

6. Empirical Results and Comparative Performance

Table: Summary of Main Metrics (Llama-3 8B)

Metric	GSAE	SAE steering	SafeSwitch
Selective refusal $\{h_i, z_i, \hat h_i\}_{i=1}^N$ 1	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 2	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 3	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 4
TriviaQA (utility)	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 5	—	—
TruthfulQA (utility)	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 6	—	—
GSM8K (utility)	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 7	—	—
Robust harm-refusal rate	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 8	$\{h_i, z_i, \hat h_i\}_{i=1}^N$ 9	$L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\\|Z\\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 0

GSAE steering provides a substantial increase in selective refusal score ( $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 1) over both standard SAE steering and SafeSwitch, while retaining strong task accuracy across QA benchmarks. On adversarial jailbreak tests (GCG, AutoDAN, TAP, adaptive), GSAE sustains a harm-refusal rate $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 2, whereas prior methods degrade to $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 3– $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 4. Performance generalizes across LLaMA-3, Mistral, Qwen, and Phi model families, consistently exceeding SafeSwitch by $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 5– $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 6 points in $L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)$ 7.

7. Significance and Conceptual Advances

GSAEs address limitations of prior activation steering approaches, which operationalized abstract safety concepts as single-feature phenomena. Experiments confirm that GSAEs recover smooth, distributed latent representations necessary to steer for nuanced, non-localized safety attributes (e.g., refusal, temporality), enforcing adaptive refusals while minimizing detrimental effects on benign utility. The dual-gated inference mechanism, underpinned by graph-regularized autoencoding, supports real-time control over LLM outputs in both prompt and continuation phases. This suggests a general paradigm for distributed concept operationalization in safety-critical model interventions (Yeon et al., 7 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-Regularized Sparse Autoencoders (GSAEs).