NeoBabel: A Multilingual Open Tower for Visual Generation

Published 8 Jul 2025 in cs.CL, cs.AI, and cs.CV | (2507.06137v1)

Abstract: Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a scalable multilingual framework that integrates a unified multimodal transformer for effective cross-language image generation.
The methodology employs progressive pretraining stages—including pixel dependency learning, large-scale multilingual data alignment, and refined instruction tuning—to boost performance.
Evaluation shows NeoBabel’s superior performance on benchmarks like m-GenEval, offering efficient, culturally inclusive, and visually consistent outputs across languages.

NeoBabel: A Multilingual Open Tower for Visual Generation

Introduction to NeoBabel

NeoBabel proposes an innovative framework for multilingual image generation, addressing the prevalent bias toward English-centric systems. By introducing a scalable and inclusive model capable of understanding and generating images in multiple languages, the authors aim to mitigate the digital divide and enhance access to generative AI tools globally.

Figure 1: NeoBabel establishes a new Pareto frontier in multilingual image generation performance, efficiency, and inclusivity.

Model Architecture

Core Components

NeoBabel's architecture centers around a multilingual transformer backbone, integrating a unified multimodal embedding space. The model uses a pretrained tokenizer from a multilingual LLM, Gemma-2, for textual inputs and a specifically trained MAGVIT-v2 quantizer for images, ensuring fine-grained visual encoding.

Transformer Backbone

By extending the LLM's capabilities, NeoBabel integrates text and image processing in a shared space, allowing rich cross-modal interactions. This is accomplished through modality-aware attention patterns, which use causal attention for texts and bidirectional attention for images, supporting complex and coherent generation tasks.

Training Methodology

Progressive Pretraining

The training process is divided into three stages: pixel dependency learning, scaling alignment with large-scale multilingual data, and refined multilingual pretraining. This approach incrementally enhances the model's ability to generate high-quality, cross-linguistically coherent images.

Instruction Tuning

By employing high-resolution datasets, the instruction tuning phase refines NeoBabel’s understanding and execution of multilingual visual tasks. Adjustments in dataset mixing ratios during tuning improve performance across languages.

Evaluation and Results

Quantitative Analysis

NeoBabel surpasses state-of-the-art models on multilingual benchmarks, such as m-GenEval and m-DPG, by leveraging its smaller yet effective architecture. It achieves a robust balance of high performance and efficiency without relying on bulky models.

Figure 2: m-GenEval benchmark comparison showcasing NeoBabel's strong cross-lingual performance.

Qualitative Analysis

NeoBabel provides semantically and visually consistent outputs across multiple languages, demonstrating its ability to maintain the meaningful execution of diverse cultural and linguistic inputs. The model also supports complex generative tasks like image inpainting and extrapolation using multilingual prompts.

Figure 3: Qualitative evaluation of NeoBabel showing language-agnostic visual coherence.

Advanced Capabilities

NeoBabel extends its utility through innovative features like multilingual image inpainting and cross-lingual prompt generation, enabling diverse and inclusive user engagement in AI-powered creativity.

Figure 4: Multilingual image inpainting capability of NeoBabel.

Conclusion

NeoBabel represents a significant step toward democratizing generative AI, removing language barriers and ensuring equitable access to technology. By providing an open-source toolkit, it encourages further research and development in multilingual AI systems, fostering a future where generative models can accurately and fairly serve a global audience while preserving cultural diversity. Future directions involve scaling the model to support additional languages and exploring integration with more diverse datasets to further enhance cultural understanding and representation in AI-generated content.