Translated TOFU Benchmarks Overview
- Translated TOFU benchmarks are systematic test sets adapted from monolingual or unimodal datasets to evaluate model generalization, cross-lingual transfer, and targeted forgetting.
- They utilize a blend of automated translation, manual curation, token alignment, and filtering techniques to preserve semantic fidelity and cultural context.
- Evaluation metrics like accuracy, pass@1, and cross-lingual forgetting scores drive insights for improving multilingual, code, and multimodal model performance.
Translated TOFU Benchmarks refer to benchmarks that have been systematically adapted from their original monolingual or unimodal form—often in English or Python—into multiple languages, programming languages, or modalities (such as visual input), in order to facilitate comprehensive evaluation of model generalization, cross-lingual transfer, and practical robustness. These benchmarks are central to the assessment of multilingual models, code LLMs (CLMs), multimodal systems, and privacy-preserving learning protocols. They comprise both straightforward automatic translations and meticulously validated human-curated sets, often employing advanced filtering or harmonization to safeguard fidelity. Notably, several research efforts employ the TOFU term (“Translated-Only, Fully-Utilized” or “Task of Fictitious Unlearning”) for benchmarks purpose-built to probe transferability, privacy, and targeted forgetting across languages or modalities (Qiu et al., 2022, Dandamudi et al., 2024, Plaza et al., 2024, Lizzo et al., 10 Jan 2026, Pippi et al., 6 Mar 2025).
1. Benchmark Translation Methodologies
Translated TOFU benchmarks employ a range of approaches:
- Automated Translators: Machine translation engines (e.g., Azure Translator, M2M-100-large) or transpilation tools (PyJs, Py2Java) enable large-scale translation of natural language text, programming problems, or code. Automated pipelines excel in scalability but are prone to semantic drift, idiomatic errors, or poorly supported domain constructs (Dandamudi et al., 2024, Qiu et al., 2022, Plaza et al., 2024).
- Manual Curation: Human experts or dedicated translators rewrite prompts, problems, or test cases case-by-case. This approach maximizes semantic fidelity and cultural/idiomatic adaptation but is costly and does not scale to many languages (Dandamudi et al., 2024, Plaza et al., 2024).
- Script and Tokenization Alignment: Many benchmarks explicitly produce parallel datasets in both standard and romanized scripts (e.g., Chinese Han vs. Pinyin, Hindi Devanagari vs. ISO Roman) to control for tokenization effects in model architectures (Lizzo et al., 10 Jan 2026).
- Filtering and Validation: Automated metrics—Token-to-Type Ratio (TTR) complement, BLEU copying score—are used to discard low-quality or substantially untranslated items. Thresholds are tuned per language group to maintain corpus integrity (Qiu et al., 2022).
- Cultural and Contextual Adaptation: Post-processing steps include mapping named entities, idioms, and cultural references into locally appropriate equivalents, guided by bilingual experts (Plaza et al., 2024).
A typical pipeline involves machine translation of source benchmarks, error-driven human validation (focusing on “correct in source, incorrect in translation” items), heuristic filtering, and back-translation consistency checks.
2. Dataset Characteristics and Coverage
Translated TOFU benchmarks span a variety of underlying domains:
- Multilingual Natural Language Benchmarks: MMLU subsets (Miscellaneous, Philosophy, Policy), HumanEval code tasks, vision-language corpora (Conceptual Captions, GQA, NLVR2) are translated into Spanish, French, Hindi, Chinese, Russian, and more (Plaza et al., 2024, Qiu et al., 2022, Dandamudi et al., 2024).
- Programming Language Benchmarks: HumanEval translated via MultiPL-E (automated AST-based transpilation) and HumanEvalSynthesize (manual translation) into JavaScript, Java, Rust, among others (Dandamudi et al., 2024).
- Multimodal Evaluations: Image-caption datasets, VQA, image-pair reasoning, retrieval tasks in up to 20 languages form the basis for assessment of multilingual multimodal models (Qiu et al., 2022, Pippi et al., 6 Mar 2025).
- Cross-Lingual Unlearning Probes: TOFU benchmarks are instantiated in multiple languages/scripts for direct comparison of targeted erasure and retention (Lizzo et al., 10 Jan 2026).
- Token Reduction and Multi-image Inputs: Benchmarks such as LLaVA-Interleave and ComPairs challenge LMMs with large multi-image contexts and token-rich inputs, evaluating the efficacy of visual token fusion strategies (Pippi et al., 6 Mar 2025).
Table 1: Languages and Benchmark Types in Recent Translated TOFU Efforts
| Paper/Project | Languages/Scripts | Data Type / Domain |
|---|---|---|
| Spanish MMLU (Plaza et al., 2024) | Spanish (Azure, GPT-4), English | QA & reasoning |
| MultiPL-E (Dandamudi et al., 2024) | Python, JavaScript, Java, Rust | Code generation |
| HumanEvalSynthesize (Dandamudi et al., 2024) | Python, JavaScript, Java, Rust | Code generation |
| TD-MML (Qiu et al., 2022) | 20 languages (Arabic, Bengali, Mandarin, etc) | Vision-language |
| TOFU Unlearning (Lizzo et al., 10 Jan 2026) | Eng, Spa, Ita, Hindi (x2), Chinese (x2) | QA, memory |
| ToFu (visual) (Pippi et al., 6 Mar 2025) | N/A (visual token inputs) | Multimodal reasoning |
3. Evaluation Metrics and Formulas
Evaluation protocols account for both surface-level and end-to-end functional fidelity:
- Correctness and Accuracy: Simple accuracy (Acc_EN, Acc_ES), absolute drop (ΔAcc), and relative drop (RelDrop) compare model performance across source and translated benchmarks (Plaza et al., 2024).
- Pass@1 and Perplexity: Code benchmarks quantify functional correctness via pass@1—the fraction of unit tests passed by a single model output—and reference perplexity as a training-time fitness measure (Dandamudi et al., 2024).
- Cross-lingual Forgetting Score (): For targeted forgetting, statistical distinctions between original and unlearned outputs on the forget set quantify erasure, with as success (Lizzo et al., 10 Jan 2026).
- Model Utility (): Normalized harmonic mean of downstream task metrics, preserving aggregate post-unlearning capability (Lizzo et al., 10 Jan 2026).
- Filtering Scores: Badness metrics for translation filtering (complement of TTR, BLEU copying) control for repetition and untranslated text, with empirically tuned thresholds (Qiu et al., 2022).
4. Empirical Findings and Lessons Learned
Empirical analysis of translated TOFU benchmarks uncovers several trends:
- Translation Fidelity is Critical: In Spanish MMLU, over half the accuracy drop is attributed to translation errors—proper names, technical terms, idioms, semantic shifts, and grammatical issues are the major culprits. Manual corrections can recover 50 % of these errors (Plaza et al., 2024).
- Replicability Limitations: In code and QA domains, discrepancies between automated and human translation, test harness adaptation, and coverage gaps yield inconsistent evaluation results. Lack of standardized splits impedes reproducibility (Dandamudi et al., 2024).
- Tokenization and Script Effects: Cross-lingual unlearning shows that English and Latin-script languages transfer forgetting more robustly than non-Latin scripts, due in part to model vocabulary and tokenization—transliteration (Pinyin, Roman Hindi) partially bridges this gap (Lizzo et al., 10 Jan 2026).
- Filtering and Pretraining Quality: Automated filtering of poor translations using TTR and BLEU scores maintains benchmark integrity; in visual-language corpora, machine translation for both pretraining and fine-tuning substantially improves zero-shot and few-shot performance (Qiu et al., 2022).
- Resource Trade-offs and Efficiency: In multi-image LMM benchmarks, visual token fusion drastically reduces memory and runtime requirements (up to 80 GB savings) while preserving or improving answer accuracy (Pippi et al., 6 Mar 2025).
5. Recommendations and Best Practices
The synthesis of empirical work yields consensus on methodology for building translated TOFU benchmarks:
- Expert Review of Flagged Items: Target manual validation at items which diverge post-translation (“correct_EN & wrong_ES”), for maximal efficiency.
- Back-Translation and Heuristics: Employ back-translation and named-entity recognition to flag likely translation mistakes, reducing human workload.
- Cultural/Idiomatic Adaptation: Insert explanatory notes or map entities for region-specific content; direct translation alone is insufficient (Plaza et al., 2024).
- Community-Sourced Revision: Open hosting and crowdsourced correction (e.g. #Somos600M) provide sustainable error mitigation and coverage extension.
- Harmonization of Test Harnesses: Adopting a single, well-documented execution environment ensures consistency; clear provenance for language splits and translation quality is essential (Dandamudi et al., 2024).
- Automatic Quality Filtering: Systematic filtering of translated datasets via TTR and BLEU metrics is recommended for large-scale endeavors (Qiu et al., 2022).
- Balanced Pipeline Automation: Where possible, combine automated translation (scalability) with fallback mechanisms and human-in-the-loop checks.
6. Impact and Open Challenges
Translated TOFU benchmarks have enabled rigorous evaluation of multilingual, cross-modal, and cross-lingual systems:
- Robust Cross-Lingual Evaluation: Benchmarks spanning multiple languages/scripts reveal nontrivial challenges in generalization, targeted forgetting, and cultural adaptation, informing both modeling and data collection strategies (Lizzo et al., 10 Jan 2026).
- Scalable Multimodal Pretraining: Machine-translated multimodal corpora (TD-MML) demonstrably outperform solely English pretraining or limited few-shot augmentation, especially in zero-shot transfer settings (Qiu et al., 2022).
- Token Efficiency for Visual Benchmarks: Benchmark-driven innovation in token reduction enables practical scaling of LMMs to multi-image tasks (Pippi et al., 6 Mar 2025).
- Model Development Guidance: The limitations exposed—translation fidelity, replication fragility, domain adaptation—shape future directions for multilingual benchmark design and model assessment (Dandamudi et al., 2024, Plaza et al., 2024).
This suggests ongoing research must integrate automated translation, expert review, error-driven sampling, and community-driven revision to advance benchmark fidelity, coverage, and impact.