Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graph-Regularized Sparse Autoencoders (GSAEs)

Updated 14 December 2025
  • Graph-Regularized Sparse Autoencoders (GSAEs) are deep models that integrate graph-based regularization and sparsity to capture distributed, safety-aligned neural representations.
  • They combine reconstruction loss, ℓ1 sparsity, and a Laplacian smoothness penalty over neuron co-activation graphs to enforce coherent latent feature extraction.
  • Empirical results demonstrate that GSAEs boost selective refusal performance and robustness against adversarial attacks, significantly outperforming previous safety steering approaches.

Graph-Regularized Sparse Autoencoders (GSAEs) are a class of neural architectures designed to recover distributed, concept-aligned representations from deep models, notably for intervening on LLMs in safety-critical contexts. By introducing a Laplacian smoothness penalty over a neuron co-activation graph, GSAEs extend traditional sparse autoencoders (SAEs) to capture safety concepts as coherent patterns spanning multiple latent features, rather than isolating them within single dimensions. Empirical evidence demonstrates that GSAEs enable state-of-the-art selective refusal performance and robustness against adversarial prompt attacks, substantially improving upon prior safety steering methods (Yeon et al., 7 Dec 2025).

1. Model Architecture

GSAEs process pooled hidden states hRdh \in \mathbb{R}^d extracted from selected transformer layers as input. The encoder is a linear transformation followed by a ReLU activation: z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d The latent code zRkz \in \mathbb{R}^k is enforced to be sparse via 1\ell_1 regularization. The decoder reconstructs the input using a linear transformation: h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k} For a dataset of NN samples, the sets {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N are maintained. This architecture is designed to enable distributed feature encoding while favoring sparse, interpretable activations.

2. Objective Function and Graph Regularization

GSAEs optimize a composite loss: LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right) where:

  • Lrecon=i=1Nhih^i22L_{\rm recon} = \sum_{i=1}^N \| h_i - \hat h_i \|_2^2 is the reconstruction error.
  • The sparsity penalty L1=λi=1Nzi1L_{\ell_1} = \lambda \sum_{i=1}^N \| z_i \|_1 encourages most latent activations to be zero.
  • The graph Laplacian term uses z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d0, where z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d1 is the adjacency matrix of the neuron co-activation graph (detail in Section 3), and z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d2 is the diagonal degree matrix. This term enforces decoded features z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d3 (columns of z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d4) to be smooth with respect to neuron co-activations: z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d5 The overall effect is to favor features that capture smooth, distributed structure over the co-activation manifold inferred from model activations (Yeon et al., 7 Dec 2025).

3. Construction of the Neuron Co-Activation Graph

The co-activation graph z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d6 encodes functional similarity among neurons based on their activation profiles across inputs. The construction procedure:

  • Collect pooled activations z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d7
  • For each neuron pair, compute cosine similarity: z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d8
  • Adjacency entries are thresholded at z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d9 (e.g., zRkz \in \mathbb{R}^k0): zRkz \in \mathbb{R}^k1
  • The degree matrix zRkz \in \mathbb{R}^k2 is diagonal with zRkz \in \mathbb{R}^k3
  • The unnormalized Laplacian zRkz \in \mathbb{R}^k4 is then used in the regularization term.

This process grounds concept learning in the empirical distribution of neuron co-activations, enabling the Laplacian penalty to promote structured, interpretable decompositions.

4. Runtime Safety Steering with Dual Gating

GSAEs are applied for online safety intervention with a two-stage gating system, enabling dynamic, context-dependent steering of LLM outputs.

4.1 Assembling the Spectral Vector Bank

For each decoder vector zRkz \in \mathbb{R}^k5, three scores quantify suitability for steering:

  • Spectral smoothness: zRkz \in \mathbb{R}^k6, then zRkz \in \mathbb{R}^k7
  • Semantic relevance: zRkz \in \mathbb{R}^k8 via a supervised linear probe discriminating harmful from benign prompts
  • Causal efficacy: zRkz \in \mathbb{R}^k9 is the observed change in refusal probability when intervening along 1\ell_10

Latent directions are combined with final weights: 1\ell_11 (1\ell_12 in experiments). The most salient vectors by 1\ell_13 comprise the "spectral vector bank" used for steering.

4.2 Input and Continuation Gating

  • Input gate: A prompt is encoded to 1\ell_14 and passed to a random forest classifier 1\ell_15 yielding 1\ell_16. For 1\ell_17, output is refused; for 1\ell_18, decoding proceeds unaltered; otherwise monitoring mode is entered.
  • Continuation gate: At each decode step, a risk score 1\ell_19 (from the classifier) determines gating via hysteresis thresholds: exceeding h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}0 for h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}1 steps opens the gate (h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}2), while falling below h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}3 for h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}4 steps closes it (h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}5).

4.3 Steering Intervention

When the gate is open, the hidden state is updated: h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}6 where h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}7 is the set of top-h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}8 vectors by h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k}9. The intervention NN0 is applied prior to logits projection, modifying output probabilities to enforce safety.

5. Training Procedure and Hyperparameters

Key hyperparameters for empirical effectiveness:

  • Latent dimension NN1
  • Sparsity weight NN2
  • Graph regularization NN3
  • Graph threshold NN4
  • Optimizer: Adam, learning rate NN5, batch size 32, NN6 iterations
  • Gate thresholds: NN7; NN8; hysteresis steps NN9
  • Steering strength {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N0

Training and intervention phases are modular, with three algorithmic phases—(1) GSAE model training, (2) spectral vector bank curation, (3) dual-gated steering—each summarized by precise pseudocode (Yeon et al., 7 Dec 2025).

6. Empirical Results and Comparative Performance

Table: Summary of Main Metrics (Llama-3 8B)

Metric GSAE SAE steering SafeSwitch
Selective refusal {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N1 {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N2 {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N3 {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N4
TriviaQA (utility) {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N5
TruthfulQA (utility) {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N6
GSM8K (utility) {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N7
Robust harm-refusal rate {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N8 {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N9 LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)0

GSAE steering provides a substantial increase in selective refusal score (LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)1) over both standard SAE steering and SafeSwitch, while retaining strong task accuracy across QA benchmarks. On adversarial jailbreak tests (GCG, AutoDAN, TAP, adaptive), GSAE sustains a harm-refusal rate LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)2, whereas prior methods degrade to LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)3–LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)4. Performance generalizes across LLaMA-3, Mistral, Qwen, and Phi model families, consistently exceeding SafeSwitch by LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)5–LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)6 points in LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right)7.

7. Significance and Conceptual Advances

GSAEs address limitations of prior activation steering approaches, which operationalized abstract safety concepts as single-feature phenomena. Experiments confirm that GSAEs recover smooth, distributed latent representations necessary to steer for nuanced, non-localized safety attributes (e.g., refusal, temporality), enforcing adaptive refusals while minimizing detrimental effects on benign utility. The dual-gated inference mechanism, underpinned by graph-regularized autoencoding, supports real-time control over LLM outputs in both prompt and continuation phases. This suggests a general paradigm for distributed concept operationalization in safety-critical model interventions (Yeon et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-Regularized Sparse Autoencoders (GSAEs).