Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graph-Regularized Sparse Autoencoders (GSAEs)

Updated 14 December 2025
  • Graph-Regularized Sparse Autoencoders (GSAEs) are deep models that integrate graph-based regularization and sparsity to capture distributed, safety-aligned neural representations.
  • They combine reconstruction loss, ℓ1 sparsity, and a Laplacian smoothness penalty over neuron co-activation graphs to enforce coherent latent feature extraction.
  • Empirical results demonstrate that GSAEs boost selective refusal performance and robustness against adversarial attacks, significantly outperforming previous safety steering approaches.

Graph-Regularized Sparse Autoencoders (GSAEs) are a class of neural architectures designed to recover distributed, concept-aligned representations from deep models, notably for intervening on LLMs in safety-critical contexts. By introducing a Laplacian smoothness penalty over a neuron co-activation graph, GSAEs extend traditional sparse autoencoders (SAEs) to capture safety concepts as coherent patterns spanning multiple latent features, rather than isolating them within single dimensions. Empirical evidence demonstrates that GSAEs enable state-of-the-art selective refusal performance and robustness against adversarial prompt attacks, substantially improving upon prior safety steering methods (Yeon et al., 7 Dec 2025).

1. Model Architecture

GSAEs process pooled hidden states hRdh \in \mathbb{R}^d extracted from selected transformer layers as input. The encoder is a linear transformation followed by a ReLU activation: z=ReLU(W(e)h),W(e)Rk×d,  kdz = \mathrm{ReLU}(W^{(e)} h), \quad W^{(e)} \in \mathbb{R}^{k \times d},\; k \gg d The latent code zRkz \in \mathbb{R}^k is enforced to be sparse via 1\ell_1 regularization. The decoder reconstructs the input using a linear transformation: h^=W(d)z,W(d)Rd×k\hat h = W^{(d)} z, \quad W^{(d)} \in \mathbb{R}^{d \times k} For a dataset of NN samples, the sets {hi,zi,h^i}i=1N\{h_i, z_i, \hat h_i\}_{i=1}^N are maintained. This architecture is designed to enable distributed feature encoding while favoring sparse, interpretable activations.

2. Objective Function and Graph Regularization

GSAEs optimize a composite loss: LGSAE=Lrecon+λZ1+μTr((W(d))LW(d))L_{\text{GSAE}} = L_{\rm recon} + \lambda\,\|Z\|_1 + \mu\,\mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right) where:

  • Lrecon=i=1Nhih^i22L_{\rm recon} = \sum_{i=1}^N \| h_i - \hat h_i \|_2^2 is the reconstruction error.
  • The sparsity penalty L1=λi=1Nzi1L_{\ell_1} = \lambda \sum_{i=1}^N \| z_i \|_1 encourages most latent activations to be zero.
  • The graph Laplacian term uses L=DAL = D - A, where AA is the adjacency matrix of the neuron co-activation graph (detail in Section 3), and DD is the diagonal degree matrix. This term enforces decoded features vjv_j (columns of W(d)W^{(d)}) to be smooth with respect to neuron co-activations: Lgraph=μj=1kvjLvj=μTr((W(d))LW(d))L_{\rm graph} = \mu \sum_{j=1}^k v_j^\top L v_j = \mu\, \mathrm{Tr}\left((W^{(d)})^\top L W^{(d)}\right) The overall effect is to favor features that capture smooth, distributed structure over the co-activation manifold inferred from model activations (Yeon et al., 7 Dec 2025).

3. Construction of the Neuron Co-Activation Graph

The co-activation graph G=(V,E)G=(V,E) encodes functional similarity among neurons based on their activation profiles across inputs. The construction procedure:

  • Collect pooled activations HRd×NH \in \mathbb{R}^{d \times N}
  • For each neuron pair, compute cosine similarity: sij=hi,:,hj,:hi,:2hj,:2s_{ij} = \frac{\langle h_{i,:}, h_{j,:} \rangle}{\|h_{i,:}\|_2\,\|h_{j,:}\|_2}
  • Adjacency entries are thresholded at TT (e.g., T=0.6T=0.6): Aij={sijif sij>T 0otherwiseA_{ij} = \begin{cases} s_{ij} & \text{if } s_{ij} > T \ 0 & \text{otherwise} \end{cases}
  • The degree matrix DD is diagonal with Dii=jAijD_{ii} = \sum_j A_{ij}
  • The unnormalized Laplacian L=DAL = D - A is then used in the regularization term.

This process grounds concept learning in the empirical distribution of neuron co-activations, enabling the Laplacian penalty to promote structured, interpretable decompositions.

4. Runtime Safety Steering with Dual Gating

GSAEs are applied for online safety intervention with a two-stage gating system, enabling dynamic, context-dependent steering of LLM outputs.

4.1 Assembling the Spectral Vector Bank

For each decoder vector viv_i, three scores quantify suitability for steering:

  • Spectral smoothness: Ei=viLvi/vi22E_i = v_i^\top L v_i / \|v_i\|_2^2, then silap=exp(βEi)s^{\rm lap}_i = \exp(-\beta E_i)
  • Semantic relevance: sisup=θis^{\rm sup}_i = |\theta_i| via a supervised linear probe discriminating harmful from benign prompts
  • Causal efficacy: siinfls^{\rm infl}_i is the observed change in refusal probability when intervening along viv_i

Latent directions are combined with final weights: wi=(silap)α(sisup)β(siinfl)γj(sjlap)α(sjsup)β(sjinfl)γw_i = \frac{ (s^{\rm lap}_i)^\alpha (s^{\rm sup}_i)^\beta (s^{\rm infl}_i)^\gamma }{ \sum_j (s^{\rm lap}_j)^\alpha (s^{\rm sup}_j)^\beta (s^{\rm infl}_j)^\gamma } (α=β=γ=1\alpha = \beta = \gamma = 1 in experiments). The most salient vectors by wiw_i comprise the "spectral vector bank" used for steering.

4.2 Input and Continuation Gating

  • Input gate: A prompt is encoded to zz and passed to a random forest classifier gg yielding Prharm\Pr_{\rm harm}. For Prharm>thigh\Pr_{\rm harm} > t_{\rm high}, output is refused; for Prharm<tlow\Pr_{\rm harm} < t_{\rm low}, decoding proceeds unaltered; otherwise monitoring mode is entered.
  • Continuation gate: At each decode step, a risk score rtr_t (from the classifier) determines gating via hysteresis thresholds: exceeding dhighd_{\rm high} for SS_\uparrow steps opens the gate (yt=1y_t = 1), while falling below dlowd_{\rm low} for SS_\downarrow steps closes it (yt=0y_t = 0).

4.3 Steering Intervention

When the gate is open, the hidden state is updated: Δht=α0iSwicos(ht,vi)vi\Delta h_t = \alpha_0 \sum_{i \in \mathcal S} w_i \cos(h_t, v_i) v_i where S\mathcal S is the set of top-mm vectors by wiw_i. The intervention hthtΔhth_t \leftarrow h_t - \Delta h_t is applied prior to logits projection, modifying output probabilities to enforce safety.

5. Training Procedure and Hyperparameters

Key hyperparameters for empirical effectiveness:

  • Latent dimension k=16dk = 16d
  • Sparsity weight λ=1×104\lambda = 1 \times 10^{-4}
  • Graph regularization μ=1×103\mu = 1 \times 10^{-3}
  • Graph threshold T=0.6T=0.6
  • Optimizer: Adam, learning rate 1×1031\times 10^{-3}, batch size 32, 1×1051 \times 10^5 iterations
  • Gate thresholds: (tlow,thigh)=(0.30,0.65)(t_{\rm low}, t_{\rm high}) = (0.30, 0.65); (dlow,dhigh)=(0.7,0.9)(d_{\rm low}, d_{\rm high}) = (0.7, 0.9); hysteresis steps (2,3)(2, 3)
  • Steering strength α0=2.5\alpha_0 = 2.5

Training and intervention phases are modular, with three algorithmic phases—(1) GSAE model training, (2) spectral vector bank curation, (3) dual-gated steering—each summarized by precise pseudocode (Yeon et al., 7 Dec 2025).

6. Empirical Results and Comparative Performance

Table: Summary of Main Metrics (Llama-3 8B)

Metric GSAE SAE steering SafeSwitch
Selective refusal AsA_s 82%82\% 42%42\% 58%58\%
TriviaQA (utility) 70.0%70.0\%
TruthfulQA (utility) 65.4%65.4\%
GSM8K (utility) 74.2%74.2\%
Robust harm-refusal rate 90%\geq90\% 4070%40-70\% 4070%40-70\%

GSAE steering provides a substantial increase in selective refusal score (AsA_s) over both standard SAE steering and SafeSwitch, while retaining strong task accuracy across QA benchmarks. On adversarial jailbreak tests (GCG, AutoDAN, TAP, adaptive), GSAE sustains a harm-refusal rate 90%\geq 90\%, whereas prior methods degrade to $40$–70%70\%. Performance generalizes across LLaMA-3, Mistral, Qwen, and Phi model families, consistently exceeding SafeSwitch by $10$–$20$ points in AsA_s.

7. Significance and Conceptual Advances

GSAEs address limitations of prior activation steering approaches, which operationalized abstract safety concepts as single-feature phenomena. Experiments confirm that GSAEs recover smooth, distributed latent representations necessary to steer for nuanced, non-localized safety attributes (e.g., refusal, temporality), enforcing adaptive refusals while minimizing detrimental effects on benign utility. The dual-gated inference mechanism, underpinned by graph-regularized autoencoding, supports real-time control over LLM outputs in both prompt and continuation phases. This suggests a general paradigm for distributed concept operationalization in safety-critical model interventions (Yeon et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-Regularized Sparse Autoencoders (GSAEs).