YaPO: Sparse Policy Optimization
- YaPO is a reference-free method that employs learnable sparse activation vectors via a Sparse Autoencoder for fine-grained control in LLM alignment.
- It overcomes the entanglement of dense steering approaches, yielding rapid convergence, stable learning, and robust domain adaptation.
- Empirical evaluations reveal improved MCQ accuracy, minimized localization gaps, and preserved general knowledge compared to dense methods.
Yet Another Policy Optimization (YaPO) is a reference-free method for steering LLMs using learnable sparse activation steering vectors in the latent space of a Sparse Autoencoder (SAE). Developed to address the entanglement and instability inherent in dense activation steering methods, YaPO enables fine-grained alignment and domain adaptation—especially in settings where precise distinctions among closely related behaviors or values (e.g., cultural alignment) are required. By optimizing sparse, interpretable codes within an SAE basis, YaPO facilitates stable and efficient learning of steering directions, outperforming dense preference optimization approaches in convergence speed, robustness, and overall performance without compromising the model’s general knowledge (Bounhar et al., 13 Jan 2026).
1. Domain Adaptation Challenge and Motivation
Domain adaptation and alignment in LLMs seek to guide model behaviors—such as adherence to cultural norms, truthfulness, or safety—without the computational or data demands of full model fine-tuning. Traditional dense steering methods, exemplified by Bi-directional Preference Optimization (BiPO), apply a single learned vector to the residual stream of an LLM at a chosen layer, trained via preference-based DPO objectives. However, LLM neurons exhibit multi-semanticity (superposition), with each neuron typically encoding multiple latent factors, causing dense vectors to entangle disparate behaviors and undermining effective, stable, and fine-grained control.
Sparse approaches leverage SAEs trained on hidden activations to decompose them into high-cardinality, approximately monosemantic and nonnegative features. Steering in this sparse latent space allows the optimization of directions that are (i) more interpretable, (ii) less interfering across unrelated behaviors, and (iii) conducive to faster, more stable convergence due to enhanced gradient properties (Bounhar et al., 13 Jan 2026).
2. Model Architecture: Sparse Autoencoder and Intervention Mechanism
Let denote the hidden state at transformer layer for input , and let denote the upper layers and output softmax. YaPO operates by introducing a sparse, trainable code () into the latent space defined by a pretrained and frozen SAE with encoder and decoder .
Key architectural features include:
- Nonnegativity: The activated codes pass through a ReLU, enforcing .
- Sparsity: SAE training applies or KL-divergence penalties to encourage sparsity in code activations.
- Residual Correction: To correct for imperfect SAE reconstruction, the steered hidden state is defined as:
When , recovers exactly. The vector is only updated during steering optimization; parameters of the SAE and LLM remain frozen.
3. Optimization Objective
The main optimization goal is to maximize preference alignment using a DPO-style objective adapted for sparse codes:
For BiPO (dense baseline):
For YaPO (sparse steering):
A regularization term may be included: where regularization further enforces sparsity on and the reconstruction penalty maintains fidelity (typically negligible).
Backpropagation is performed only on , with both SAE and LLM weights fixed.
4. Training Procedure and Computational Analysis
The training algorithm operates as follows:
1 2 3 4 5 6 7 8 9 10 11 12 |
for epoch in 1..N: Sample minibatch {(x_i, y_i^w, y_i^l)}_{i=1}^B Sample d ∈ {−1, +1} for each i in minibatch: h_i = A_L(x_i) s_i = Enc(h_i) s̃_i = ReLU(s_i + d·v) ĥ_i = Dec(s̃_i) ĥ_i = Dec(Enc(h_i)) h'_i = ĥ_i + (h_i − ĥ_i) Compute batch BiPO-style loss on {h'_i} Update v using AdamW optimizer |
Empirically, YaPO achieves preference loss in fewer than 150 steps, while BiPO typically requires over 600 steps to reach higher loss plateaus (). The sparse steering direction’s disentanglement properties yield smoother convergence and improved stability. Memory costs are dominated by ( for 2B models; $131,000$ for 9B), with minimal additional overhead from the SAE. Training times are approximately 10 minutes (2B model on 8 × AMD MI210 GPUs) and 30 minutes (9B), with main compute bottlenecks being two LLM forward passes and a single small SAE forward/backward pass per batch (Bounhar et al., 13 Jan 2026).
5. Experimental Protocol and Results
Datasets:
- The main benchmark covers 5 languages, 15 country variants, and approximately 45,000 paired prompts contrasting localized and non-localized contexts.
- 52 topics spanning meals, etiquette, family, routines, and ceremonies are included.
Metrics:
- MCQ accuracy (%)
- Open-ended judged score (0–10)
- RCA (Robust Cultural Accuracy): harmonic mean of localized and non-localized performance
- PNLG (Performance-Normalized Localization Gap): normalized gap between settings
Baselines:
- No steering (LLM baseline)
- CAA (Contrastive Activation Addition; dense, static)
- SAS (Sparse Activation Averaging; SAE basis, static)
- BiPO (dense preference optimization)
Summary of Key Results:
| Metric | Baseline | YaPO | Gain |
|---|---|---|---|
| MCQ Accuracy (%) | 26.7 | 41.2 | +14.5 pp |
| Open-ended Score | 1.59 | 2.22 | +0.63 |
| PNLG MCQ (↓) | 0.253 | 0.184 | -0.069 |
| PNLG OG (↓) | 1.466 | 1.308 | -0.158 |
| MMLU Score | ~57.6 | ~57.6 | – |
YaPO demonstrates rapid convergence (≲150 steps), robustness to the parameter, and matches or surpasses baselines on hallucination, wealth-seeking, jailbreak, and power-seeking tasks. General knowledge retention, measured by MMLU, shows no measurable degradation with steering (Bounhar et al., 13 Jan 2026).
6. Broader Applications and Current Limitations
Applications documented include:
- Fine-grained cultural adaptation
- Hallucination suppression
- Wealth-seeking moderation
- Jailbreak resistance
- Power-seeking attenuation
YaPO preserves general knowledge across tasks, indicating alignment interventions do not degrade baseline capabilities. Identified limitations are the dependence on a suitable pretrained SAE (Gemma-Scope for Gemma-2 family in current experiments), the cultural dataset’s focus on between-country rather than within-country distinctions, and untested generalization to architectures beyond Gemma-2 absent a compatible SAE.
7. Practical Deployment and Integration Guidelines
Hyperparameters (2B model):
- Learning rate: (AdamW; , , weight decay $0.05$)
- Batch size/GPU: 4, no gradient accumulation
- Epochs: 20 (150 steps to convergence)
- (temperature): 1.0 (default)
- Steering multiplier : 1.0; robust to
- SAE insertion layer: 15 (2B), 28 (9B)
- SAE code size : 65,000 (2B), 131,000 (9B)
Inference and integration practices:
- Positive steering at inference: apply .
- Use residual correction to ensure identity behavior if .
- tuning is recommended only if over-steering is observed; YaPO is less sensitive than CAA or SAS.
- Layer selection by activation patching: select the SAE/LLM layer yielding maximal difference between localized vs. non-localized patching.
- Code and replication resources are available at https://github.com/MBZUAI-Paris/YaPO.
YaPO constitutes a generalizable, interpretable, and computationally efficient mechanism for domain adaptation and fine-grained alignment of LLMs via sparse activation interventions, exhibiting superior empirical properties to dense steering methods on both performance and stability metrics (Bounhar et al., 13 Jan 2026).