Cross-Lingual Unlearning Techniques
- Cross-lingual unlearning is a process that purposefully removes specific language-bound knowledge from multilingual models while maintaining non-targeted capabilities.
- It employs gradient-based, subspace projection, and adaptive optimization techniques to selectively erase sensitive or unwanted information.
- Empirical evaluations use metrics like Forgetting Quality and utility scores to balance removed content with retained performance across different languages.
Cross-lingual unlearning refers to the process of purposefully removing, erasing, or otherwise diminishing specific learned knowledge, skills, or behaviors in one or more languages in a multilingual neural LLM—while minimizing collateral damage to retained multilingual capabilities. The problem is core to regulatory compliance, safety, privacy (e.g., removal of sensitive or copyrighted data), and behavioral control of LLMs operating across many linguistic domains. Unlike monolingual unlearning, cross-lingual unlearning must contend with the intricate parameter sharing and latent subspace geometry that underpin modern multilingual models, where facts, skills, or behaviors may reside in weights common to multiple languages.
1. Definitions, Phenomena, and Foundational Concepts
Cross-lingual unlearning targets the elimination of information learned in one language and examines the degree to which this forgetting generalizes or transfers across languages the model supports. A primary challenge is that multilingual models encode knowledge both in shared “interlingua” parameter subspaces and language-specific representations. As a result, standard unlearning methods applied to a single language often fail to induce forgetting elsewhere, especially when semantic and tokenization factors are not explicitly accounted for (Lizzo et al., 10 Jan 2026, Choi et al., 2024).
Catastrophic forgetting, a related concept from continual learning, is formalized as a loss in performance on an original source-language task after adaptation to new target languages (Koloski et al., 2023). In the more operational paradigm of machine unlearning, the objective is to produce a model parameter vector such that, with respect to a designated “forget” set and “retain” set (partitioned by language), outputs on are no longer reliably present, while performance on is preserved (Choi et al., 2024, Farashah et al., 9 Jan 2026).
2. Methodological Approaches
2.1 Gradient-Based and Objective-Driven Algorithms
Existing methods include:
- Gradient Ascent (“maximum loss”): increases loss on the forget set, but generally fails to yield cross-lingual unlearning and often collapses overall utility (Lizzo et al., 10 Jan 2026, Choi et al., 2024).
- Gradient-Difference (GradDiff) and Variants: minimize negative cross-entropy on forget data while maximizing (or minimally perturbing) it on retain data, with or without KL-regularization anchoring the model distribution near the original on the retain set (Farashah et al., 9 Jan 2026).
- Negative Preference Optimization (NPO): assigns negative preference to the targeted output in the forget set relative to a reference model (Farashah et al., 9 Jan 2026).
- FLAT (F-divergence Loss Adjustment): introduces explicit divergence constraints, though cross-lingual effects remain negligible (Lizzo et al., 10 Jan 2026).
These approaches are susceptible to "language escape" phenomena, whereby the model avoids generating in the forbidden language but can still output the banned content in a different language or script (Hwang et al., 28 Oct 2025).
2.2 Subspace Projection and Geometric Methods
Subspace-projection approaches (e.g., UNLEARN) identify and remove low-dimensional parameter subspaces associated with the forget task. The key insight is that cross-lingual facts are encoded in shared subspaces: excising these directions through projection causes forgetting across all languages aligned with that subspace, including those not seen in the unlearning update (Lizzo et al., 10 Jan 2026). For language-specific unlearning, orthogonal language-residual projections can cause selective forgetting. Algorithmically, this proceeds by (i) constructing gradient matrices distinguishing forget vs. control tasks per language, (ii) extracting task-specific and language-shared subspaces via SVD/PCA, and (iii) projecting weight updates orthogonally to target dimensions.
2.3 Adaptive and Language-Weighted Optimization Frameworks
Adaptive unlearning combines per-language losses with tunable weights and a global trade-off parameter : This enables practitioners to up-weight low-resource languages (which are particularly vulnerable if ignored) and interpolate between aggressive forgetting and full retention (Choi et al., 2024).
2.4 Training-Free Interventions
Training-free inference-time techniques (e.g., Neuron Adjust, Key Space Detection) leverage the geometry of neuron activations to distinguish queries corresponding to the skill/language to be suppressed. Intervening on neuron distributions (shifting pre-activations) or abstaining on queries whose activations fall into pre-identified “forbidden” regions enables high-precision, language-targeted skill unlearning while minimally impacting other skills (Li et al., 27 Mar 2025).
3. Metrics and Evaluation Paradigms
Assessment of cross-lingual unlearning relies on metrics capturing both the completeness of forgetting and the preservation of unrelated capabilities:
- Forgetting Quality (FQ): statistical tests (e.g., two-sample p-values) over output distributions between original and unlearned models on forget sets (Lizzo et al., 10 Jan 2026).
- Memorization/Probing Accuracy: top-1 cloze or token recall on forget and retain sets, reported per language (Choi et al., 2024).
- Change in macro- or truth ratio: for source vs. target languages (Koloski et al., 2023, Farashah et al., 9 Jan 2026).
- General Model Utility (U): harmonic means of downstream tasks, normalized such that indicates zero degradation (Lizzo et al., 10 Jan 2026).
- N-Mix Score: quantifies “language confusion” by measuring the proportion of n-gram fragments in responses that differ from the query language (Hwang et al., 28 Oct 2025).
Critical flaws in reference-based metrics have been exposed: after unlearning, models may “hide” retained knowledge in an alternative language, causing surface-form metrics to report spurious forgetting. Semantic-based, language-agnostic evaluation protocols (using LLM judges to compare the semantic content of generated and reference answers) are now recommended to robustly measure true forgetting and retention (Hwang et al., 28 Oct 2025).
4. Empirical Findings and Language Transfer Patterns
A consistent finding is that most naive unlearning methods provide little or no cross-lingual transfer: when the forget set is supplied in a single language, the model typically remains capable of producing the forbidden fact or behavior in other languages, especially those that are low-resource or typologically distant (Choi et al., 2024, Lizzo et al., 10 Jan 2026, Farashah et al., 9 Jan 2026). Subspace-projection is uniquely effective for cross-lingual forgetting, particularly along the shared “interlingua” directions, yielding high FQ and minimal utility loss in all tested Latin-script languages and even under script variation (Lizzo et al., 10 Jan 2026).
Empirical studies show that:
- For skill unlearning, techniques such as Key-Space Detection produce >80% performance drops on the targeted language, with <2% drops on others, indicating highly localized, language-specific unlearning (Li et al., 27 Mar 2025).
- Unlearning in high-resource languages is more stable, and asymmetric transfer is observed (e.g., unlearning in Russian transfers more strongly to English than vice versa) (Farashah et al., 9 Jan 2026).
- Syntactic similarity—quantified using typological features—is the best predictor of cross-lingual unlearning transfer: unlearning spills over between syntactically similar languages, but phonological inventory and script similarity have weaker effects (Farashah et al., 9 Jan 2026).
A notable risk is that English-only unlearning induces language confusion, with the model responding in a different language rather than forgetting the forbidden fact. Only multilingual, cross-lingual forget sets and unlearning objectives can robustly prevent “language escape” (Hwang et al., 28 Oct 2025, Choi et al., 2024).
5. Practical Guidelines, Implementation, and Trade-offs
Best practices for cross-lingual unlearning are dictated by the specific application requirements:
- If best target-language accuracy is paramount: For sequential cross-lingual transfer, full-parameter fine-tuning (Intermediate Training) maximizes target performance, but with some risk of source-language forgetting (Koloski et al., 2023).
- If minimizing collateral forgetting is critical: Use Cross-Lingual Validation, especially in combination with adapters or per-language validation signals, and adopt subspace-projection or training-free abstention for precision (Koloski et al., 2023, Li et al., 27 Mar 2025).
- For maximum robustness against leakage: Apply unlearning objectives across all languages (joint or many-to-one setups), using language-adaptive weighting to avoid under-forgetting in low-resource domains (Choi et al., 2024, Hwang et al., 28 Oct 2025).
- Script and tokenization mismatches: Utilize transliteration and script-agnostic preprocessing to align the action of subspace-based methods and ensure forgetting propagates under script divergence (Lizzo et al., 10 Jan 2026).
Key hyperparameters include the trade-off parameter between forgetting and retention, the per-language weight vector , and in geometric methods, the subspace rank or abstention region parameters (Choi et al., 2024, Li et al., 27 Mar 2025).
6. Theoretical Insights and Implications for Model Design
The geometry of weight space in multilingual LLMs strongly conditions cross-lingual unlearning dynamics. “Interlingua” subspaces encode language-agnostic semantics, with language-specific residuals capturing orthogonalized information. Effective cross-lingual unlearning thus hinges on identifying and, if desired, abrogating shared subspaces to ensure consistent forgetting across languages and scripts (Lizzo et al., 10 Jan 2026, Farashah et al., 9 Jan 2026).
Future model designs may incorporate explicit subspace partitioning or parameter modularization to facilitate rapid, selective unlearning with fine-grained linguistic control. Likewise, resource-aware optimization—especially for models with vast linguistic coverage and data imbalance—will be key to preventing memory of sensitive information in underrepresented languages (Choi et al., 2024, Farashah et al., 9 Jan 2026). Evaluation standards are expected to shift toward semantic, reference-free metrics robust to language confounding and language-mix phenomena (Hwang et al., 28 Oct 2025).