MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

Published 8 Oct 2024 in cs.CL and cs.AI | (2410.05873v2)

Abstract: English-centric LLMs often show strong multilingual capabilities. However, their multilingual performance remains unclear and is under-evaluated for many other languages. Most benchmarks for multilinguality focus on classic NLP tasks or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages that English-centric LLMs use English as a pivot language in their intermediate layers. MEXA computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in different languages. We conduct controlled experiments using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves an average Pearson correlation of 0.90 between its predicted scores and actual task performance across languages. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://cis-lmu-mexa.hf.space, Code: https://github.com/cisnlp/MEXA.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel evaluation method that leverages cross-lingual alignment through parallel sentences to assess the multilingual potential of English-centric LLMs with a high Pearson correlation of 0.90.
It employs weighted average sentence embeddings with mean pooling to consistently yield precise alignment scores across diverse languages and model layers.
Experimental results using datasets like FLORES-200 and the Bible validate the approach, highlighting superior multilingual performance in models such as Gemma 2 and Llama 3.1-70B.

Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

The paper presents a novel method for evaluating the multilingual capabilities of English-centric LLMs. This approach addresses the gap in comprehensive multilingual performance assessments by leveraging cross-lingual alignment through parallel sentences.

Overview of the Method

The authors introduce a method called , which assesses the alignment between English and other languages within LLMs by examining parallel sentences. The alignment serves as a proxy for estimating the multilingual understanding capabilities of these models, facilitating a more accurate prediction of their performance across diverse languages.

Experimental Setup

The study utilizes various parallel datasets, such as FLORES-200 and the Bible, along with a selection of LLMs including the Llama, Gemma, Mistral, and OLMo families. The authors also incorporate downstream tasks such as Belebele, m-MMLU, and m-ARC to establish benchmarks for evaluation.

Results and Findings

The results indicate a high average Pearson correlation of 0.90 when comparing scores from the proposed alignment method with established downstream tasks. This suggests a strong reliability in using the scores as indicators of multilingual potential. Additionally, the analysis reveals distinctions in model performance, highlighting the advanced multilingual abilities of models like Gemma 2 and Llama 3.1-70B compared to others such as OLMo.

Analysis of Sentence Embeddings

Two main strategies for computing sentence embeddings—weighted average and last token embeddings—are explored. The study demonstrates the efficacy of using weighted average embeddings alongside mean pooling, which consistently yields the most precise alignment scores across different languages and model layers.

Implications and Future Directions

This method provides substantial insight into the cross-lingual alignment capabilities within LLMs, shedding light on their structural multilingualism through a detailed examination of sentence embeddings. The insights gained from this study are crucial for further model development and enhancement of multilingual performance, paving the way for more equitable language understanding across underrepresented languages.

Future research could expand on these findings by exploring additional language script combinations and probing deeper into the inner workings of LLM layers. The approach also highlights the need for developing more extensive and diverse multilingual benchmarks, potentially incorporating cultural and language-specific nuances to further refine the understanding of LLM capabilities.

Overall, this research contributes significantly to the field of multilingual NLP by providing a robust framework for evaluating and understanding the multilingual potential of English-centric LLMs through innovative cross-lingual alignment methods.

Markdown Report Issue