DSC2025 ViHallu Challenge
- DSC2025 ViHallu Challenge is a shared task that systematically addresses hallucination detection in Vietnamese models through a curated dataset of context, prompt, and response triplets.
- The challenge uses diverse methodologies including structured prompting with factual, noisy, and adversarial inputs to stress-test model robustness and annotation protocols.
- Top systems achieved over a 50 percentage point improvement in macro-F1 score compared to baseline models, highlighting innovations in ensemble strategies and fine-tuning techniques.
The DSC2025 ViHallu Challenge is a rigorous, large-scale shared task designed to benchmark and advance hallucination detection in Vietnamese LLMs and, in the parallel multimodal track, Large Vision-LLMs (LVLMs). Hallucinations—instances where model outputs are fluent yet contradict, distort, or fabricate information relative to the input—present a critical reliability bottleneck for production deployment of language and vision-language AI. While English-centric benchmarks and methodologies have evolved for hallucination detection, low-resource languages and underrepresented domains have historically lacked standardized evaluation suites and curated datasets with fine-grained annotations. The DSC2025 ViHallu Challenge addresses this gap for Vietnamese LLMs by providing the first publicly available, systematically annotated dataset of (context, prompt, response) triplets, comprehensive taxonomies for hallucination categorization, and a leaderboard-driven evaluation protocol fostering community participation and methodological innovation (Nguyen et al., 8 Jan 2026).
1. Objectives, Motivation, and Historical Context
The challenge arises from the well-documented tendency of LLMs to generate plausible-sounding but unsupported or contradictory statements, termed “hallucinations.” Although prior work in English (e.g., TruthfulQA, SemEval, FEVER) laid the foundation for basic detection protocols, Vietnamese models suffer from compounded vulnerabilities: data scarcity, limited instruction-tuning, and linguistic idiosyncrasies such as diacritic complexity. The DSC2025 ViHallu Challenge is explicitly designed to (1) establish a public, standardized benchmark for hallucination detection in Vietnamese LLMs, (2) catalyze the development and cross-comparison of mitigation techniques—including retrieval-augmented generation (RAG) loops, entailment classifiers, uncertainty estimation, and post-editing—and (3) democratize resources relevant to Vietnamese AI safety through the release of a high-quality, CC-BY-SA 4.0 dataset (Nguyen et al., 8 Jan 2026).
2. ViHallu Dataset Construction and Annotation Protocol
The ViHallu dataset comprises 10,000 annotated triplets, each denoted as (Context, Prompt, Response):
- Context (C): 1–3 sentences selected from Vietnamese Wikipedia (drawing on UIT-ViQuAD 2.0).
- Prompt (P): A question or instruction, formulated as either a factual, noisy, or adversarial challenge.
- Response (R): The output generated by a state-of-the-art, instruction-tuned LLM (GPT-4o, deterministic decoding).
Hallucinations are explicitly categorized into three disjoint classes:
- No Hallucination (“no”): Response is strictly faithful to contextual information.
- Intrinsic Hallucination (“intrinsic”): Response contradicts or distorts facts from the context.
- Extrinsic Hallucination (“extrinsic”): Response introduces information not present in the context, irrespective of its real-world correctness.
Prompt types are designed to stress-test model robustness:
- Factual: Clean, extraction-based queries tightly coupled to the context.
- Noisy: Factual prompts with controlled noise (diacritic removal, character swaps, typos) to simulate input errors.
- Adversarial: LLM-generated prompts containing presuppositions or logic specifically constructed to induce hallucinations (Nguyen et al., 8 Jan 2026).
The data is split as follows: 7,000 instances for training, 1,000 for the public test set, and 2,000 for the private test set.
3. Evaluation Metrics and Baseline System
System performance is measured primarily by the macro-F1 metric across the three hallucination categories:
with for {no, intrinsic, extrinsic}.
A secondary metric, overall accuracy, is used solely for tie-breaking. The official baseline is a multilingual encoder model (PhoBERT) fine-tuned for three-way classification. The input is serialized as [CLS] C [SEP] P [SEP] R [SEP], with a linear classification head on the [CLS] embedding, trained via cross-entropy loss and AdamW optimizer without task-specific augmentation. The baseline achieves approximately 32.83% macro-F1 and ≈0.33 accuracy on both development and private test sets (Nguyen et al., 8 Jan 2026).
| System | Macro-F1 (Private Test) | Architecture/Strategy |
|---|---|---|
| PhoBERT (Baseline) | 32.83% | Encoder-only, standard fine-tuning |
| HCMUS-ThangQuang | 84.80% | Qwen3-4B-Instruct, task-description+examples prompting |
| HCMUTransformer | 84.73% | Ensemble (35 LoRA adapters on Qwen3), SLSQP weighting |
| UIT_WhiteCow | 84.54% | Dual-LLM, temperature-based voting |
4. Participant Approaches, Results, and Comparative Performance
The task attracted broad participation: 155 teams registered, 136 met data-sharing prerequisites, and 111 submitted systems to the public leaderboard. Leading approaches uniformly leveraged instruction-tuned LLMs in the 4–7B parameter range, with prominent use of parameter-efficient finetuning (LoRA), carefully designed structured prompting, and ensemble strategies.
The best-performing system (HCMUS-ThangQuang) utilized a single Qwen3-4B-Instruct model, achieving 84.80% macro-F1 by incorporating an explicit task-description prompt and few-shot guidance. Other top systems implemented large-scale ensembles of LoRA adapters with weight optimization (HCMUTransformer), as well as dual-LLM voting schemes (UIT_WhiteCow). Stacking NLI-style encoders, multi-stage expert resolvers, and hierarchical voting also appear among strong submissions. All top systems vastly outperform the conventional encoder baseline, with absolute macro-F1 improvements above 50 percentage points. However, a significant performance gap to perfection persists, especially for intrinsic (contradiction-based) hallucinations (Nguyen et al., 8 Jan 2026).
5. Analysis of Hallucination Detection: Categories, Prompt Sensitivity, and Failure Modes
The taxonomy of hallucinations adopted by the challenge facilitates nuanced analysis:
- Intrinsic hallucinations pose the greatest detection challenge: correctly identifying subtle contradictions, entity swaps, or logic reversals often requires advanced natural language inference capabilities.
- Extrinsic hallucinations are more amenable to retrieval-augmented verification, but remain a nontrivial open problem—especially under noisy or adversarial prompting.
- Prompt engineering is critical: structured, example-driven prompts substantially aid LLM deliberation and faithfulness, while ensemble methods reduce model-specific idiosyncrasies.
A plausible implication is that both prompt design and model ensembling are essential for achieving robust performance in faithfulness-sensitive tasks. Nevertheless, annotation and detection of fine-grained contradictions in Vietnamese presents ongoing challenges due to the inherent complexity of the language and data scarcity (Nguyen et al., 8 Jan 2026).
6. Limitations, Open Problems, and Future Directions
Despite significant improvements over baseline architectures, hallucination detection in Vietnamese LLMs remains an unsolved problem. Particularly, intrinsic hallucinations resist current modeling and evaluation paradigms. Key open issues and proposed directions include:
- Retrieval-augmented verification loops to systematically check model outputs against contextual evidence and provenance.
- Contrastive and span-level annotation to isolate contradictory or fabricated fragments at a finer granularity.
- Confidence calibration and uncertainty estimation for safer, production-ready deployment of Vietnamese LLMs.
- Expansion to broader domains and low-resource languages: Generalizability beyond Wikipedia-based contexts and scalable methodology transfer to other Southeast Asian languages remain aspirational goals.
The dataset, infrastructure, and evaluation protocols established by the DSC2025 ViHallu Challenge provide a foundation for future work in hallucination detection, both within Vietnamese NLP and the broader low-resource AI safety landscape (Nguyen et al., 8 Jan 2026).
7. Connections to Vision-Language Hallucination Mitigation Methodologies
Complementary to the LLM-centric ViHallu challenge, recent advances in LVLM hallucination mitigation—such as the ViHallu framework for vision-centric alignment—demonstrate cross-modal relevance. Approaches relying on controlled visual variations and fine-grained question–answer supervision successfully reduce visual hallucinations, suggesting that exposure to hard negative examples (counterfactuals) and tight alignment objectives can be generalized across modalities (Dai et al., 29 Jul 2025). For DSC2025 participants, leveraging vision-centric paradigms, automated QA pipelines, and expert-voted evaluation—a methodology analogous to that adopted in multimodal settings—may offer promising directions for future data and model development.
References
- (Nguyen et al., 8 Jan 2026) DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
- (Dai et al., 29 Jul 2025) See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs