The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Published 31 Aug 2023 in cs.CL, cs.AI, and cs.LG | (2308.16884v2)

Abstract: We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art LLMs. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked LLMs (MLMs) and LLMs. We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

Abstract PDF Upgrade to Chat

Citations (93)

View on Semantic Scholar

Summary

The paper introduces a novel parallel reading comprehension dataset, Belebele, featuring 900 multiple-choice questions from FLoRes-200 passages across 122 language variants.
It employs a meticulous curation process to differentiate comprehension levels, enabling precise evaluation of both multilingual MLMs and LLMs.
The evaluation reveals that smaller, balanced multilingual models can outperform English-centric LLMs, underscoring the importance of vocabulary design in low-resource languages.

An Analytical Overview of the Belebele Benchmark

The paper presents Belebele, an extensive parallel reading comprehension dataset designed to evaluate natural language understanding (NLU) across 122 language variants. This dataset significantly enhances the ability to gauge multilingual capabilities of NLP models beyond traditional high-resource languages. By facilitating direct performance comparisons across languages, Belebele addresses a crucial gap in existing multilingual benchmarks.

Dataset Composition and Methodology

Belebele comprises 900 unique multiple-choice questions, each linked to distinct passages from the FLoRes-200 dataset. Each question offers four possible answers, providing a robust framework for evaluating text comprehension across varying resource levels of languages. The dataset's design, targeting high-, medium-, and low-resource languages, offers a paradigm for assessing multilingual masked LLMs (MLMs) and LLMs.

A significant strength of Belebele lies in its meticulous question curation process, which ensures that the questions differentiate between different levels of language comprehension without requiring extrinsic knowledge. The fully parallel nature of its linguistic content allows precise model performance evaluation across diverse languages, highlighting potential disparities and strengths.

Evaluation of Multilingual Models

Through exhaustive evaluations, Belebele uncovers notable insights into multilingual NLP systems. The study finds that smaller MLMs pretrained on balanced multilingual datasets often exhibit superior comprehension across languages compared to English-centric LLMs like GPT-3 and Llama. This revelation underscores the vital role of balanced data in pretraining for achieving multi-language proficiency.

Interestingly, the implications of vocabulary size and construction are substantiated, showing a correlation with enhanced performance, particularly in low-resource languages. This insight into vocabulary dynamics could influence pretraining strategies, guiding more efficient multilingual model development.

Implications and Future Directions

Belebele opens new avenues for analyzing how LLMs handle language diversity and comprehension tasks. By providing a diverse linguistic benchmark, the dataset encourages further exploration into cross-lingual transfer, script variations, and the effectiveness of different pretraining regimes.

The research emphasizes the importance of extending LLM capabilities into less-studied languages, advocating for equitable NLP systems. The findings present foundational elements for advancing AI technologies that are inclusive of linguistic diversity.

In conclusion, Belebele represents a significant step in enhancing the evaluation breadth of NLP systems, offering crucial insights for both practical model development and theoretical exploration. Future investigations might focus on improving pretraining strategies to boost performance in low-resource languages, potentially leading to a new generation of more inclusive LLMs.

Markdown Report Issue