Graph-Regularized Sparse Autoencoders (GSAEs)
- Graph-Regularized Sparse Autoencoders (GSAEs) are deep models that integrate graph-based regularization and sparsity to capture distributed, safety-aligned neural representations.
- They combine reconstruction loss, ℓ1 sparsity, and a Laplacian smoothness penalty over neuron co-activation graphs to enforce coherent latent feature extraction.
- Empirical results demonstrate that GSAEs boost selective refusal performance and robustness against adversarial attacks, significantly outperforming previous safety steering approaches.
Graph-Regularized Sparse Autoencoders (GSAEs) are a class of neural architectures designed to recover distributed, concept-aligned representations from deep models, notably for intervening on LLMs in safety-critical contexts. By introducing a Laplacian smoothness penalty over a neuron co-activation graph, GSAEs extend traditional sparse autoencoders (SAEs) to capture safety concepts as coherent patterns spanning multiple latent features, rather than isolating them within single dimensions. Empirical evidence demonstrates that GSAEs enable state-of-the-art selective refusal performance and robustness against adversarial prompt attacks, substantially improving upon prior safety steering methods (Yeon et al., 7 Dec 2025).
1. Model Architecture
GSAEs process pooled hidden states extracted from selected transformer layers as input. The encoder is a linear transformation followed by a ReLU activation: The latent code is enforced to be sparse via regularization. The decoder reconstructs the input using a linear transformation: For a dataset of samples, the sets are maintained. This architecture is designed to enable distributed feature encoding while favoring sparse, interpretable activations.
2. Objective Function and Graph Regularization
GSAEs optimize a composite loss: where:
- is the reconstruction error.
- The sparsity penalty encourages most latent activations to be zero.
- The graph Laplacian term uses , where is the adjacency matrix of the neuron co-activation graph (detail in Section 3), and is the diagonal degree matrix. This term enforces decoded features (columns of ) to be smooth with respect to neuron co-activations: The overall effect is to favor features that capture smooth, distributed structure over the co-activation manifold inferred from model activations (Yeon et al., 7 Dec 2025).
3. Construction of the Neuron Co-Activation Graph
The co-activation graph encodes functional similarity among neurons based on their activation profiles across inputs. The construction procedure:
- Collect pooled activations
- For each neuron pair, compute cosine similarity:
- Adjacency entries are thresholded at (e.g., ):
- The degree matrix is diagonal with
- The unnormalized Laplacian is then used in the regularization term.
This process grounds concept learning in the empirical distribution of neuron co-activations, enabling the Laplacian penalty to promote structured, interpretable decompositions.
4. Runtime Safety Steering with Dual Gating
GSAEs are applied for online safety intervention with a two-stage gating system, enabling dynamic, context-dependent steering of LLM outputs.
4.1 Assembling the Spectral Vector Bank
For each decoder vector , three scores quantify suitability for steering:
- Spectral smoothness: , then
- Semantic relevance: via a supervised linear probe discriminating harmful from benign prompts
- Causal efficacy: is the observed change in refusal probability when intervening along
Latent directions are combined with final weights: ( in experiments). The most salient vectors by comprise the "spectral vector bank" used for steering.
4.2 Input and Continuation Gating
- Input gate: A prompt is encoded to and passed to a random forest classifier yielding . For , output is refused; for , decoding proceeds unaltered; otherwise monitoring mode is entered.
- Continuation gate: At each decode step, a risk score (from the classifier) determines gating via hysteresis thresholds: exceeding for steps opens the gate (), while falling below for steps closes it ().
4.3 Steering Intervention
When the gate is open, the hidden state is updated: where is the set of top- vectors by . The intervention is applied prior to logits projection, modifying output probabilities to enforce safety.
5. Training Procedure and Hyperparameters
Key hyperparameters for empirical effectiveness:
- Latent dimension
- Sparsity weight
- Graph regularization
- Graph threshold
- Optimizer: Adam, learning rate , batch size 32, iterations
- Gate thresholds: ; ; hysteresis steps
- Steering strength
Training and intervention phases are modular, with three algorithmic phases—(1) GSAE model training, (2) spectral vector bank curation, (3) dual-gated steering—each summarized by precise pseudocode (Yeon et al., 7 Dec 2025).
6. Empirical Results and Comparative Performance
Table: Summary of Main Metrics (Llama-3 8B)
| Metric | GSAE | SAE steering | SafeSwitch |
|---|---|---|---|
| Selective refusal | |||
| TriviaQA (utility) | — | — | |
| TruthfulQA (utility) | — | — | |
| GSM8K (utility) | — | — | |
| Robust harm-refusal rate |
GSAE steering provides a substantial increase in selective refusal score () over both standard SAE steering and SafeSwitch, while retaining strong task accuracy across QA benchmarks. On adversarial jailbreak tests (GCG, AutoDAN, TAP, adaptive), GSAE sustains a harm-refusal rate , whereas prior methods degrade to $40$–. Performance generalizes across LLaMA-3, Mistral, Qwen, and Phi model families, consistently exceeding SafeSwitch by $10$–$20$ points in .
7. Significance and Conceptual Advances
GSAEs address limitations of prior activation steering approaches, which operationalized abstract safety concepts as single-feature phenomena. Experiments confirm that GSAEs recover smooth, distributed latent representations necessary to steer for nuanced, non-localized safety attributes (e.g., refusal, temporality), enforcing adaptive refusals while minimizing detrimental effects on benign utility. The dual-gated inference mechanism, underpinned by graph-regularized autoencoding, supports real-time control over LLM outputs in both prompt and continuation phases. This suggests a general paradigm for distributed concept operationalization in safety-critical model interventions (Yeon et al., 7 Dec 2025).