CLaS-Bench: Cross-Lingual Steering Benchmark
- CLaS-Bench is a benchmark evaluating multilingual steering in large language models by testing cross-lingual control and semantic retention.
- The framework uses 70 prompts translated into 32 languages to assess methods from simple prompting to advanced unsupervised interventions like DiffMean and LAPE.
- Key metrics include Language Forcing Success, Output Relevance, and the Harmonic-Mean Steering Score, which together highlight performance differences among steering techniques.
CLaS-Bench (“Cross-Lingual Alignment and Steering Benchmark”) is a standardized evaluation framework designed to quantify and analyze the ability of LLMs to be steered into producing output in a target language while preserving semantic fidelity. It addresses the absence of dedicated benchmarks for representation-based multilingual steering techniques, enabling systematic scientific analysis of language control methods, mechanistic insights into internal model representations, and rigorous comparison across 32 typologically diverse languages (Gurgurov et al., 13 Jan 2026).
1. Benchmark Design and Language Coverage
CLaS-Bench is constructed from 70 open-ended prompt questions selected from the Vicuna dataset, covering reasoning, factual knowledge, creative writing, opinions, and professional writing. These prompts are translated into 31 languages—including high-resource (Spanish, Chinese, German), mid-resource (Czech, Thai), and low-resource or non-Latin-script languages (Tibetan, Georgian)—using machine translation followed by manual native speaker proofreading. This results in an explicit inventory of 32 languages identified by ISO code, linguistic family, script, and resource level. Each prompt-target-language combination forms a unique test instance, yielding a total of instances per language and $71,680$ cross-language steering trials.
| Language | ISO Code | Family | Resource Level |
|---|---|---|---|
| Tibetan | bo | Sino-Tibetan | 1 |
| Maltese | mt | Afro-Asiatic | 2 |
| Romanian | ro | Indo-European | 3 |
| ... | ... | ... | ... |
CLaS-Bench’s parallel-question construction avoids confounding semantic drift, supporting robust controlled comparisons of model steering efficacy across similarly structured inputs.
2. Evaluation Protocol and Metrics
Steering success in CLaS-Bench is measured along two orthogonal axes:
- Language Control (Language Forcing Success, LFS): Detects whether the output was generated in the desired target language using a FastText-based classifier:
- Semantic Relevance (Output Relevance, OR): Uses Qwen-3-8B to judge output relevance, ignoring output language, on a 0–2 scale (0: unrelated/gibberish, 1: partially relevant, 2: clearly relevant). Scores are normalized:
Combining these metrics, CLaS-Bench defines the Harmonic-Mean Steering Score (LSS):
LSS penalizes methods that optimize one axis at the expense of the other, ensuring evaluation of both language-forcing and semantic preservation.
3. Steering Techniques
Eight steering approaches are evaluated, distinguished as either prompting baselines or internal representation-based interventions:
- Prompting Baselines
- Baseline-I: Appends “Respond in [Target Language]” in English.
- Baseline-II: Same instruction in the target language.
- Representation-Based Interventions
- LAPE (Language Activation Probability Entropy): Selects highly language-selective MLP neurons and executes additive or replacement interventions based on average activation in the target language.
- DiffMean: Unsupervised difference of means in the residual stream, modifying activations in the direction of average language shift.
- Probes: Trained binary linear probes to classify language, using the probe weight as intervention direction.
- PCA: Projects onto top- principal components of target-language residuals and reconstructs an intervention direction.
- LDA: Applies binary LDA to residuals between target and other languages, intervening along the resulting discriminative direction.
- SAE-DiffMean: Utilizes sparse autoencoder (JumpReLU) latent directions for difference-of-means interventions at the feature level, followed by decoding.
All hidden-state methods are parameterized by a strength hyperparameter , explicitly controlling the magnitude of intervention at layer .
4. Comparative Analysis and Results
Evaluation on Llama-3.1-8B across the full language set demonstrates clear stratification of method performance:
| Method | Avg. LSS (%) |
|---|---|
| Baseline-I | 67.7 |
| Baseline-II | 67.3 |
| DiffMean | 84.5 |
| Probes | 48.6 |
| PCA | 15.1 |
| SAE-DM | 42.3 |
| LDA | 23.6 |
| LAPE | 80.1 |
DiffMean achieves the highest mean LSS (84.5%), exceeding 90% in 19/32 languages. LAPE follows at 80.1%. Prompting baselines plateau at ~67% and sometimes fail completely, such as English output generation. Supervised methods (Probes, LDA) yield lower scores than unsupervised DiffMean and LAPE, implying superior generalization properties for simple direction-based interventions compared to classifier-derived approaches. SAE-based methods lag, with performance potentially limited by reconstruction error or coverage.
Layer-wise analyses reveal steering is most effective in later layers (ℓ≈16–32), with language-specific structure emerging predominantly at depth. Cosine similarity between DiffMean directions across languages decreases toward later layers, indicating robust separability. Linear probes achieve >99% accuracy by layer 14. LAPE neuron clusters and LDA’s Fisher ratio also peak in deep layers. Clustering analysis of embedding geometry uncovers distinct language family associations (Romance, Germanic, Slavic, etc.) in residual space.
5. Empirical Insights and Usage Guidelines
DiffMean’s outperformance is attributed to its unsupervised nature and representational generality, reliably identifying the primary direction of language shift in the residual stream. For layer-wise steering, recommended settings are:
- Early layers (): ; overstrength degrades coherence.
- Mid-late layers (): up to 5–10 is effective.
DiffMean’s minimal resource requirements (~10M tokens per language, no labels) make it suitable for low-resource scenarios. For robust cross-family switching (e.g., JapaneseArabic), combining DiffMean with LAPE neuron-level interventions enhances output stability. A plausible implication is that pairing direction-based and neuron-specific strategies may mitigate semantic loss across distant languages.
6. Limitations and Directions for Further Research
Methodological differences in token counts (DiffMean/LAPE: 10M, PCA: 0.5M, LDA: 0.1M) introduce variable data scale effects; future work should equalize dataset size to isolate method efficacy. SAE availability was limited to select layers and models; expansion to other architectures (such as Aya-Expanse) would enrich comparative insights. Language coverage, while broad (32 languages), omits many low-resource and typologically diverse tongues. Current experiments focus on instruction-tuned 8B models; assessment in base and very large models (>100B parameters) may reveal new patterns of language control. Mechanistic follow-ups using CLaS-Bench could localize the circuits or attention mechanisms responsible for multilingual steering.
7. Conclusion and Impact
CLaS-Bench is the first systematic, parallel-question benchmark for evaluating multilingual steering techniques. It enables the precise measurement of LLMs’ capacity for cross-lingual control and semantic preservation via both prompting and internal representation manipulation. Key findings demonstrate that unsupervised residual-based steering (DiffMean) is a powerful, cost-effective approach across language families and resource levels, outperforming supervised and neuron-specific methods. The benchmark advances both practical adaptation strategies and theoretical understanding of where and how language-specific signals manifest within large models (Gurgurov et al., 13 Jan 2026).