Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
Abstract: Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in LLMs but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
Summary
- The paper introduces Bi'an, a novel bilingual benchmark and an associated model specifically designed for evaluating and improving hallucination detection in Retrieval-Augmented Generation (RAG).
- Bi'anBench provides evaluation tasks covering diverse formats, including Question Answering, Summarization, Data-to-Text, and Machine Translation in both English and Chinese.
- Experimental results on Bi'anBench demonstrate its effectiveness for evaluating models and show that models like Qwen2.5-72B achieve high performance in detecting RAG hallucinations.
'{"REASONING": ["The DOCUMENT indicates that the Arrêt de Mérindol was issued on January 1, 1545.", "The DOCUMENT also mentions that several villages were devastated, but it does not specify the exact timing of the devastation in relation to the Arrêt de Mérindol.", "The QUESTION asks what happened second between the Arrêt de Mérindol and the devastation of villages.", "The ANSWER provided is the devastation of villages.", "Since the DOCUMENT does not provide the exact timing of the devastation of villages, it cannot be concluded that it happened after the Arrêt de Mérindol."], "SCORE": "FAIL"}', 'rejected': '{"REASONING": ["The DOCUMENT states that the French king, Francis I, issued the "Arrêt de Mérindol" on January 1, 1545, and armed a crusade against the Waldensians of Provence.", "The DOCUMENT also mentions that deaths ranged from hundreds to thousands, and several villages were devastated.", "The QUESTION asks what happened second: Arrêt de Mérindol or devastation of villages.", "The ANSWER provided is Arrêt de Mérindol.", "Based on the DOCUMENT, the Arrêt de Mérindol occurred before the devastation of villages."], "SCORE": "PASS"}'} \ & \multirow{15}{}{ZH} & {'input': 'ç»å®ä¸åâææ¡£âåâæè¦âï¼ä½ å¿ é¡»åæææä¾çâæè¦âï¼å¹¶ç¡®å®å®æ¯å¦å¿ å®äºâææ¡£âçå 容ã\textbackslash{}n âæè¦âä¸å¾æä¾è¶ åºâææ¡£â䏿ä¾çä¸ä¸æçæ°ä¿¡æ¯ãâæè¦âä¹ä¸å¾ä¸âææ¡£â䏿ä¾çä¿¡æ¯ç¸çç¾ãä¸¥æ ¼æç §ä»¥ä¸æ ¼å¼è¾åºä½ çæç»å¤æï¼å¦æâæè¦âå¿ å®äºâææ¡£âï¼å为"éè¿" ï¼å¦ææè¦ä¸å¿ å®äºææ¡£ï¼å为 "失败"ã\textbackslash{}n --\textbackslash{}n ææ¡£:客æ:请é®ä¸æ¨æéå°ä»ä¹é®é¢éè¦æå¸®å©æ¨å¤çæè è§£å³çå¢?\textbackslash{}nç¨æ·:ææ³é®ä¸ä¸æçå票ä»ä¹æ¶åå¯åæ¥\textbackslash{}n客æ:è¿ä¸ªè®¢åçµåå票已ç»å¼å ·äº\textbackslash{}n客æ:PC端:æç京ä¸âå®¢æ·æå¡âæçå票âå票详æ ä¸è½½å³å¯;APP端:æçâå®¢æ·æå¡âå票æå¡âåç¥¨è¯¦æ æ¥çå³å¯\textbackslash{}nç¨æ·:æéè¦çº¸è´¨çä¸ç¥¨\textbackslash{}nç¨æ·:ä½ åç»æåä¸ä¸æç订åï¼æç»ä½ éä¸ä¸\textbackslash{}n客æ:[订åç¼å·]æ¯è¿ä¸ªè®¢åå\textbackslash{}nç¨æ·:䏿¯\textbackslash{}nç¨æ·:[订åç¼å·]\textbackslash{}nç¨æ·:æ¯è¿ä¸ª\textbackslash{}n客æ:[订åç¼å·]å票çè¿åå·ï¼å·²ç»å¨é éä¸äº\textbackslash{}nç¨æ·:æè½æ¥ä¸ä¸å°åªäºå?\textbackslash{}n客æ:[ç«ç¹]æ¨ç订åå¨äº¬ä¸[ç«ç¹]åè´§å®æï¼åå¤éå¾äº¬ä¸[ç«ç¹]\textbackslash{}nç¨æ·:å¯ï¼å¥½ç\textbackslash{}nç¨æ·:æå¤©è½å°å?\textbackslash{}n客æ:妹å乿²¡æ³ä¿è¯ç\textbackslash{}nç¨æ·:å¯ï¼å¥½çï¼[å§å]\textbackslash{}n客æ:亲亲ï¼å®¢æ°äºï¼å¦¹ååºè¯¥åç\textbackslash{}n客æ:äº²ç±æ»´ï¼è¯·é®è¿æå ¶ä»éè¦å¸®å©çä¹?\textbackslash{}n客æ:å¦ææ²¡æçè¯ï¼è¿è¯·æ¨è®°å¾ç»[å§å]è¯ä»·åï¼åè´¢å°æç¹ä¸ç¹å¦\textbackslash{}n \textbackslash{}n--\textbackslash{}n æè¦:ç¨æ·è¯¢é®èªå·±çå票ä»ä¹æ¶é´è½å¤å¯åæ¥ï¼å¹¶è¯¢é®æ¯å¦æå¤©è½å¤å°è¾¾ã客æåçç¨æ·çå票è¿åå·²ç»å¨äº¬ä¸[ç«ç¹]åè´§å®æï¼åå¤éå¾äº¬ä¸[ç«ç¹]ï¼ä½æ æ³ç»ç¨æ·ä¿è¯æ¯æå¤©å°è´§ã\textbackslash{}n --\textbackslash{}nä½ çè¾åºåºè¯¥æ¯ JSON æ ¼å¼çï¼é®æ¯"æ¨çè¿ç¨"å"夿"ï¼{"æ¨çè¿ç¨":<ä½ å¦ä½æ¨çå¾å°å¤æç»æçæ¥éª¤ç¹>, "夿":<ä½ çæç»å¤æ>}', 'chosen': '{"æ¨çè¿ç¨": ["ç¨æ·è¯¢é®å票ä»ä¹æ¶åå¯åæ¥ï¼ææ¡£ä¸å®¢æåççµåå票已ç»å¼å ·ï¼å¹¶æä¾äºä¸è½½æ¹å¼ã", "å®¢ææ¯å¦ä¿è¯å票æå¤©å°ï¼ææ¡£ä¸å®¢æè¡¨ç¤ºæ æ³ä¿è¯ã", "æè¦ç¼ºå°ç¨æ·éè¦çº¸è´¨çä¸ç¥¨ä»¥å客ææä¾è®¢åç¼å·ä¿¡æ¯çéè¦ä¿¡æ¯ã", "æ¾ç¤ºæè¦ä¸å¤å¿ å®ã"], "夿": "失败"}', 'rejected': '{"æ¨çè¿ç¨": ["ç¨æ·è¯¢é®å票ä»ä¹æ¶åå¯åæ¥ï¼ææ¡£ä¸å®¢æåççµåå票已ç»å¼å ·ï¼å¹¶æä¾äºä¸è½½æ¹å¼ã", "ç¨æ·è¯¢é®æ¯å¦æå¤©è½å°ï¼ææ¡£ä¸å®¢æè¡¨ç¤ºæ æ³ä¿è¯ã", "æè¦ä¸æå°çå å®¹ä¸ææ¡£ä¸ç对è¯åŸºæ¬ä¸è´ï¼å¹¶ä¸å åã", "æè¦ç¼ºå°ç¨æ·éè¦çº¸è´¨çä¸ç¥¨ä»¥å客ææä¾è®¢åç¼å·ä¿¡æ¯çéè¦ä¿¡æ¯ã", "æ¾ç¤ºæè¦ä¸å¤å¿ å®ã"], "夿": "失败"}'} \ \ \bottomrule \end{tabular} } \caption{Examples of training dataset.} \label{tab:train_example} \end{table} \end{CJK*}
\subsection{Two-Stage Training Process} During the SFT stage, we use a learning rate of 1e-5, a batch size of 4, and train for 3 epochs. In the DPO stage, we set the beta value to 0.1, the learning rate to 5e-7, the batch size to 4, and train for 3 epochs. For the LoRA configuration, we set r=16, LoRA_alpha=32, and LoRA_dropout=0.05, and only fine-tune the Q, V, K, and O matrices.
\subsection{Inference Details} We use the same prompt templates as during training when conducting RAG hallucination detection inference. During inference, we set temperature=0.01 and top_p=0.1. Since the model's output is in JSON format, we use regular expressions for parsing. The computation is performed using a single Nvidia A100-80G GPU.
\section{Experiment Results} \label{app:exp} Table \ref{tab:exp_en} and Table \ref{tab:exp_zh} present the detailed experimental results for Bi'anBench_EN and Bi'anBench_ZH, respectively.
\begin{table*}[] \scalebox{0.75}{ \begin{tabular}{cccccccccc} Model & \multicolumn{4}{c}{Bi'anBench_EN} & & & & & \ \cline{2-5} & QA & Summarization & Data-to-Text & Machine Translation & Avg. & & & & \ \hline GPT-4o-0806 & 86.6 & 75.5 & 85.6 & 86.4 & 84.8 & & & & \ HaluEval_qa & 83.8 & - & - & - & - & & & & \ RAGTruth_qa & 86.6 & - & - & - & - & & & & \ FinanceBench & 86.3 & - & - & - & - & & & & \ DROP & 86.5 & - & - & - & - & & & & \ CovidQA & 86.6 & - & - & - & - & & & & \ PubMedQA & 89.0 & - & - & - & - & & & & \ ASQA & 86.3 & - & - & - & - & & & & \ IfQA & 88.5 & - & - & - & - & & & & \ FIB & - & 75.5 & - & - & - & & & & \ HaluEval_sum & - & 75.5 & - & - & - & & & & \ WebNLG & - & - & 85.6 & - & - & & & & \ RAGTruth_d2t & - & - & 85.5 & - & - & & & & \ PDC & - & - & - & 86.5 & - & & & & \ WMT21 & - & - & - & 86.4 & - & & & & \ \hline GPT-4o-mini & 82.9 & 58.9 & 82.3 & 79.6 & 78.9 & & & & \ HaluEval_qa & 78.2 & - & - & - & - & & & & \ RAGTruth_qa & 84.2 & - & - & - & - & & & & \ FinanceBench & 76.5 & - & - & - & - & & & & \ DROP & 85.5 & - & - & - & - & & & & \ CovidQA & 82.1 & - & - & - & - & & & & \ PubMedQA & 84.3 & - & - & - & - & & & & \ ASQA & 83.0 & - & - & - & - & & & & \ IfQA & 84.4 & - & - & - & - & & & & \ FIB & - & 59.6 & - & - & - & & & & \ HaluEval_sum & - & 58.3 & - & - & - & & & & \ WebNLG & - & - & 82.3 & - & - & & & & \ RAGTruth_d2t & - & - & 82.3 & - & - & & & & \ PDC & - & - & - & 80.0 & - & & & & \ WMT21 & - & - & - & 79.2 & - & & & & \ \hline Llama3.1-8B-Instruct & 72.3 & 60.2 & 62.6 & 68.3 & 68.6 & & & & \ HaluEval_qa & 71.6 & - & - & - & - & & & & \ RAGTruth_qa & 73.3 & - & - & - & - & & & & \ FinanceBench & 70.0 & - & - & - & - & & & & \ DROP & 74.1 & - & - & - & - & & & & \ CovidQA & 72.8 & - & - & - & - & & & & \ PubMedQA & 72.7 & - & - & - & - & & & & \ ASQA & 72.3 & - & - & - & - & & & & \ IfQA & 72.3 & - & - & - & - & & & & \ FIB & - & 60.7 & - & - & - & & & & \ HaluEval_sum & - & 59.7 & - & - & - & & & & \ WebNLG & - & - & 62.6 & - & - & & & & \ RAGTruth_d2t & - & - & 62.6 & - & - & & & & \ PDC & - & - & - & 67.7 & - & & & & \ WMT21 & - & - & - & 68.9 & - & & & & \ \hline Llama3.1-70B-Instruct & 83.2 & 75.2 & 80.9 & 73.3 & 80.3 & & & & \ HaluEval_qa & 81.9 & - & - & - & - & & & & \ RAGTruth_qa & 85.0 & - & - & - & - & & & & \ FinanceBench & 81.1 & - & - & - & - & & & & \ DROP & 83.7 & - & - & - & - & & & & \ CovidQA & 82.4 & - & - & - & - & & & & \ PubMedQA & 83.9 & - & - & - & - & & & & \ ASQA & 83.3 & - & - & - & - & & & & \ IfQA & 83.6 & - & - & - & - & & & & \ FIB & - & 75.2 & - & - & - & & & & \ HaluEval_sum & - & 75.2 & - & - & - & & & & \ WebNLG & - & - & 80.9 & - & - & & & & \ RAGTruth_d2t & - & - & 80.9 & - & - & & & & \ PDC & - & - & - & 73.4 & - & & & & \ WMT21 & - & - & - & 73.2 & - & & & & \ \hline Qwen2-7B-Instruct & 64.2 & 56.8 & 66.4 & 74.8 & 64.9 & & & & \ HaluEval_qa & 63.5 & - & - & - & - & & & & \ RAGTruth_qa & 64.9 & - & - & - & - & & & & \ FinanceBench & 61.2 & - & - & - & - & & & & \ DROP & 66.3 & - & - & - & - & & & & \ CovidQA & 62.6 & - & - & - & - & & & & \ PubMedQA & 65.5 & - & - & - & - & & & & \ ASQA & 64.0 & - & - & - & - & & & & \ IfQA & 64.8 & - & - & - & - & & & & \ FIB & - & 56.7 & - & - & - & & & & \ HaluEval_sum & - & 56.9 & - & - & - & & & & \ WebNLG & - & - & 66.4 & - & - & & & & \ RAGTruth_d2t & - & - & 66.4 & - & - & & & & \ PDC & - & - & - & 74.9 & - & & & & \ WMT21 & - & - & - & 74.7 & - & & & & \ \hline Qwen2-72B-Instruct & 82.7 & 73.6 & 77.0 & 82.1 & 80.5 & & & & \ HaluEval_qa & 81.7 & - & - & - & - & & & & \ RAGTruth_qa & 82.9 & - & - & - & - & & & & \ FinanceBench & 80.7 & - & - & - & - & & & & \ DROP & 84.1 & - & - & - & - & & & & \ CovidQA & 82.2 & - & - & - & - & & & & \ PubMedQA & 83.3 & - & - & - & - & & & & \ ASQA & 82.5 & - & - & - & - & & & & \ IfQA & 83.2 & - & - & - & - & & & & \ FIB & - & 73.7 & - & - & - & & & & \ HaluEval_sum & - & 73.4 & - & - & - & & & & \ WebNLG & - & - & 77.0 & - & - & & & & \ RAGTruth_d2t & - & - & 77.1 & - & - & & & & \ PDC & - & - & - & 82.6 & - & & & & \ WMT21 & - & - & - & 81.5 & - & & & & \ \hline Qwen2.5-7B-Instruct & 71.6 & 66.1 & 72.8 & 80.9 & 72.3 & & & & \ HaluEval_qa & 71.1 & - & - & - & - & & & & \ RAGTruth_qa & 72.2 & - & - & - & - & & & & \ FinanceBench & 68.7 & - & - & - & - & & & & \ DROP & 73.0 & - & - & - & - & & & & \ CovidQA & 70.1 & - & - & - & - & & & & \ PubMedQA & 72.5 & - & - & - & - & & & & \ ASQA & 71.7 & - & - & - & - & & & & \ IfQA & 72.0 & - & - & - & - & & & & \ FIB & - & 66.7 & - & - & - & & & & \ HaluEval_sum & - & 65.4 & - & - & - & & & & \ WebNLG & - & - & 72.8 & - & - & & & & \ RAGTruth_d2t & - & - & 72.8 & - & - & & & & \ PDC & - & - & - & 80.6 & - & & & & \ WMT21 & - & - & - & 81.2 & - & & & & \ \hline Qwen2.5-14B-Instruct & 79.8 & 73.1 & 79.6 & 87.2 & 79.8 & & & & \ HaluEval_qa & 79.1 & - & - & - & - & & & & \ RAGTruth_qa & 80.4 & - & - & - & - & & & & \ FinanceBench & 76.7 & - & - & - & - & & & & \ DROP & 81.3 & - & - & - & - & & & & \ CovidQA & 78.8 & - & - & - & - & & & & \ PubMedQA & 80.4 & - & - & - & - & & & & \ ASQA & 79.6 & - & - & - & - & & & & \ IfQA & 79.5 & - & - & - & - & & & & \ FIB & - & 73.6 & - & - & - & & & & \ HaluEval_sum & - & 72.5 & - & - & - & & & & \ WebNLG & - & - & 79.6 & - & - & & & & \ RAGTruth_d2t & - & - & 79.6 & - & - & & & & \ PDC & - & - & - & 86.8 & - & & & & \ WMT21 & - & - & - & 87.6 & - & & & & \ \hline Qwen2.5-72B-Instruct & {\ul 85.7} & {\ul 74.7} & 78.7 & 86.6 & 83.3 & & & & \ HaluEval_qa & {\ul 84.9} & - & - & - & - & & & & \ RAGTruth_qa & {\ul 86.2} & - & - & - & - & & & & \ FinanceBench & 83.1 & - & - & - & - & & & & \ DROP & {\ul 86.4} & - & - & - & - & & & & \ CovidQA & 84.7 & - & - & - & - & & & & \ PubMedQA & {\ul 86.0} & - & - & - & - & & & & \ ASQA & {\ul 8
Paper to Video (Beta)
No one has generated a video about this paper yet.
Whiteboard
No one has generated a whiteboard explanation for this paper yet.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Open Problems
We haven't generated a list of open problems mentioned in this paper yet.
Continue Learning
- How do the performance metrics across different models in Bi'anBench_EN reflect the effectiveness of RAG hallucination mitigation strategies?
- What are the main differences between the SFT and DPO stages in fine-tuning large language models, and how do they impact hallucination rates?
- How does the hallucination detection evaluation process in this paper compare with other commonly used benchmarks like HaluEval or RAGTruth?
- What are the implications of hardware choices, such as using a single Nvidia A100-80G GPU, on the reproducibility and scalability of the experiments?
- Find recent papers about hallucination detection in retrieval-augmented generation models.
Related Papers
- FinanceBench: A New Benchmark for Financial Question Answering (2023)
- Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately (2024)
- Pronunciation recognition of English phonemes /\textipa{@}/, /æ/, /\textipa{A}:/ and /\textipa{2}/ using Formants and Mel Frequency Cepstral Coefficients (2017)
- OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain (2024)
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (2025)
- Expect the Unexpected: FailSafe Long Context QA for Finance (2025)
- Qwen2.5-VL Technical Report (2025)
- How well do LLMs reason over tabular data, really? (2025)
- PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation (2025)
- Technical Report: Full-Stack Fine-Tuning for the Q Programming Language (2025)
Authors (4)
Collections
Sign up for free to add this paper to one or more collections.
Tweets
Sign up for free to view the 4 tweets with 16 likes about this paper.