Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

Published 19 Sep 2024 in cs.CL | (2409.12746v2)

Abstract: In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) reasoning questions are challenging for models, (ii) smaller models perform worse than larger models and degrade faster in Spanish than in English and (iii) the performance gap between languages is negligible for the best models and grows up to 37% for smaller models. Model ranking on UNED-ACCESS 2024 is almost identical in English and Spanish, and has also a high correlation (0.98 Pearson) with ranking on MMLU, suggesting that a small dataset is sufficiently diverse and representative to measure performance by discipline.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces the UNED-ACCESS 2024 dataset of 1003 bilingual multiple-choice questions for zero-shot evaluation of LLMs.
The paper evaluates both open-source and proprietary models using Cohen’s Kappa, revealing notable language gaps in smaller models.
The paper correlates results with MMLU to validate the dataset’s reliability and underscores the need for enhanced multilingual LLM optimization.

An Examination of Bilingual Evaluation of LLMs on General Knowledge in University Entrance Exams

The paper "Bilingual Evaluation of LLMs on General Knowledge in University Entrance Exams with Minimal Contamination" introduces UNED-ACCESS 2024, a bilingual dataset comprising 1003 multiple-choice questions derived from university entrance exams in Spanish and English. The dataset was meticulously compiled and translated to ensure minimal contamination, presenting a unique opportunity to evaluate the performance of various LLMs across both languages. This study evaluates several current open-source and proprietary models under a uniform zero-shot experimental setting and correlates these results with those obtained from an equivalent subset of the MMLU dataset.

Key Contributions

Introduction of UNED-ACCESS 2024: The dataset includes 1003 multiple-choice questions from various domains such as Business Administration, Biology, Mathematics, Literature, and Psychology. These questions were originally formulated in Spanish and manually translated into English by professional translators to ensure high fidelity and minimal contamination.
Comprehensive Evaluation: A diverse set of LLMs, including proprietary models like GPT-4-Turbo and Claude-3-Opus, and open-source models such as Llama-2-7B and Gemma-2-27B, were evaluated on the dataset. The testing was executed in a zero-shot setting to ensure uniformity and to simulate real-world use-cases where examples are not provided.
Language Performance Gap Analysis: The study provides an analysis of the performance gap between Spanish and English, revealing that the performance gap varies inversely with the model quality—the most advanced models show negligible gaps, while smaller models demonstrate a significant drop in performance in Spanish.
Correlation with MMLU: By comparing the results obtained from UNED-ACCESS 2024 with a subset of the MMLU dataset, the paper validates the applicability and reliability of the new dataset. The high correlation between the results on both datasets suggests that the UNED-ACCESS 2024 dataset is representative and diverse enough despite its smaller size.

Evaluation Metrics and Results

The models were evaluated using Cohen’s Kappa, a robust measure that accounts for the chance agreement in multiple-choice questions. High correlation values were observed between the performance on UNED-ACCESS and MMLU, reinforcing the dataset’s validity.

Proprietary Models: Models like Claude-3 and GPT-4-Turbo led in performance with minimal differences between their Spanish and English results. Claude-3 achieved a Cohen’s Kappa of 0.81 in Spanish and 0.79 in English, which indicates a minimal negative gap.
Open-Source Models: Larger models like Llama-3-70B and Gemma-2-27B showed strong performance with negligible gaps, whereas smaller models like Leniachat-Gemma-2B demonstrated larger discrepancies with up to a 37.27% performance drop in Spanish.

Implications and Future Directions

The results indicate that larger, more advanced models are becoming increasingly effective in bilingual contexts, narrowing the performance gap between languages. However, smaller models still exhibit significant performance degradation in Spanish, highlighting the need for more training data and better optimization techniques for lesser-resourced languages.

The robustness of the UNED-ACCESS 2024 dataset, despite its smaller size, suggests that future research could benefit from similar high-quality, low-contamination datasets. More expansive datasets covering higher cognitive tasks and specialized subjects could further enhance the evaluation of LLMs.

Conclusion

The introduction of the UNED-ACCESS 2024 dataset provides valuable insights into the bilingual capabilities of current state-of-the-art LLMs. The study demonstrates that with minimal contamination and high-quality translations, such datasets can serve as reliable benchmarks. The findings encourage further development and optimization of LLMs to handle multiple languages effectively, particularly for smaller and specialized models. Future work should focus on expanding the dataset to include more challenging questions and additional subjects, thus providing a broader evaluation spectrum for emerging LLMs.

Markdown Report Issue