Papers
Topics
Authors
Recent
Search
2000 character limit reached

MEGA: Multilingual Evaluation of Generative AI

Published 22 Mar 2023 in cs.CL | (2303.12528v4)

Abstract: Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.

Citations (216)

Summary

  • The paper introduces MEGA, a benchmarking framework assessing multilingual LLM performance across 70 languages using varied NLP tasks.
  • The analysis finds that generative models perform better in Latin-script languages due to richer pre-training data, while low-resource languages face challenges.
  • The study shows that enhanced tokenization and adaptive prompt strategies, like translate-test, can significantly boost outputs for underrepresented languages.

MEGA: Multilingual Evaluation of Generative AI

The paper "MEGA: Multilingual Evaluation of Generative AI" addresses the challenge of evaluating LLMs like ChatGPT, GPT-4, and others in a multilingual context. The study is predicated on the notion that while generative AI has demonstrated remarkable capabilities in several tasks within NLP, its performance is predominantly evaluated in English, thereby leaving its efficacy in other languages largely unexplored.

The authors introduce MEGA, a comprehensive benchmarking framework devised to evaluate generative LLMs across a diverse linguistic spectrum encompassing 70 languages from 16 NLP datasets characterized by typological diversity. This extensive evaluation serves to compare these generative models against state-of-the-art (SOTA) non-autoregressive models. A key goal of the framework is to ascertain the comparative performance of these models not only across different languages but also across varied NLP tasks, such as classification, question answering, sequence labeling, and natural language generation.

Key Findings

  1. Performance Disparity: MEGA illustrates a noticeable disparity in the performance of LLMs between English and non-English languages, with the gap further exacerbated in low-resource languages and those employing non-Latin scripts. Generative models, particularly state-of-the-art ones, are shown to perform significantly better in Latin-script languages, revealing a bias towards linguistically prevalent languages in their training data.
  2. Impact of Pre-Training Data: The multilingual capabilities of these models correlate with the pre-training data's linguistic diversity. While models like GPT-4 mitigate the performance gap to a certain extent, non-English languages, especially those underrepresented in pre-training corpora, see a marked drop in performance.
  3. Tokenization and Prompt Strategy: The study explores the implications of tokenizer quality and prompt strategies in multilingual contexts. Poor tokenization quality of under-represented languages results in inflated token counts, which hampers model performance. Prompt adaptations such as 'translate-test' strategies offer significant improvements for low-resource languages.
  4. Evaluation Strategies: The analysis covers multiple prompting strategies, including monolingual, zero-shot cross-lingual, and translate-test prompts, revealing varying levels of effectiveness across tasks and languages.

Implications and Future Directions

The paper implies that advancing multilingual generative AI requires a multifaceted approach focusing on enhancing pre-training data diversity, improving tokenization strategies, and developing refined multilingual prompt strategies. Future research directions include the assessment of low-resource languages' specific needs in NLP applications, leading to more equitable AI systems.

A significant practical implication of this work lies in pushing the AI research agenda toward creating inclusive technology that admirably serves linguistically diverse users. The MEGA framework not only establishes a benchmark but opens inquiries into better methods for evaluating multilingual generative models, particularly in languages that lack extensive resources.

In conclusion, the MEGA benchmark sets a precedent for comprehensive multilingual evaluations, underlining both the current achievements and limitations of generative AI across languages. It also fosters further research toward refining AI models that equitably perform across the linguistic spectrum, plugging gaps in capabilities for lesser-represented languages.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 18 tweets with 1 like about this paper.