FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

Published 20 Apr 2025 in cs.CL and cs.AI | (2504.14690v1)

Abstract: Research on evaluating and analyzing LLMs has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating LLMs in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current LLMs are still far from being able to solve this benchmark

Abstract PDF Upgrade to Chat

Summary

The paper introduces FarsEval-PKBETS, a new benchmark specifically designed to evaluate large language models (LLMs) in the Persian language, addressing the need for evaluation beyond high-resource languages.
FarsEval-PKBETS features 4,000 diverse questions across multiple domains and formats, uniquely incorporating Persian cultural and linguistic nuances with input from domain experts.
Evaluations using the benchmark show existing LLMs like Llama3-70B, PersianMind, and Dorna achieve below 50% accuracy on average, highlighting their current limitations in handling Persian cultural and linguistic content.

FarsEval-PKBETS: A Benchmark for Persian LLMs

The paper introduces FarsEval-PKBETS, a structured benchmark designed to evaluate the performance of LLMs specifically catered to the Persian language. The development of this benchmark responds to the notable neglect in evaluating LLMs in low-resource languages like Persian, contrasted with the extensive body of research focused on high-resource languages, particularly English.

Composition and Methodology

FarsEval-PKBETS comprises 4,000 questions in various response formats, including multiple-choice, short-answer, and descriptive prompts. The benchmark covers an extensive range of domains: medicine, law, religion, the Persian language, social and encyclopedic knowledge, ethics, and text generation tasks. A significant highlight is its incorporation of linguistic, cultural, and local nuances pertinent to Persian and Iranian contexts, often absent in non-English benchmarks. The dataset also emphasizes challenging question designs to push the limits of current LLM competencies.

The benchmark methodology included contributions from experts across different fields, ensuring the datasets are both challenging and accurately reflect domain-specific knowledge. This diverse input is key to generating authentic and culturally resonant questions.

Evaluation and Results

The study evaluates three distinct models: Llama3-70B, PersianMind, and Dorna, using the FarsEval-PKBETS benchmark. The averaged accuracy of these models falls below 50%, underscoring the challenges these LLMs face when tested against this Persian-specific benchmark. For example, accuracy varies significantly across categories, with performance in general medicine and lexical semantics notably higher than in Persian language tasks and text generation. These results spotlight the models' difficulty in comprehending culturally and linguistically specific content, indicating substantive room for improvement in the way LLMs handle non-English languages.

Significance and Future Directions

This research implies both a theoretical and practical significance. Theoretically, it stresses the need for LLMs to better integrate cultural and linguistic nuances, particularly those outside high-resource languages. Practically, it highlights the inadequacy of current evaluations or models and suggests a pathway for LLM enhancement. The benchmark may guide future improvements in LLMs targeting Persian, and potentially extend a framework for other underrepresented languages.

Further investigations might expand FarsEval-PKBETS to include a wider array of linguistic features or incorporate real-world applications tailored to native speakers' cultural contexts. Additionally, the persistent advancement in AI models mandates the continual development of benchmarks that adapt to and anticipate future capabilities of evolving LLMs.

Markdown Report Issue