ViHallu Dataset: Benchmarking AI Hallucinations
- ViHallu Dataset is a dual-resource benchmark for detecting and mitigating hallucinations in Vietnamese LLMs and LVLMs.
- It features a rigorous annotation process with stratified sampling, ensuring balanced evaluation across no, intrinsic, and extrinsic hallucination classes.
- The vision-language component leverages controlled image variations with paired QA for fine-grained factuality assessment in LVLMs.
ViHallu Dataset refers to two distinct resources developed for benchmarking and mitigating hallucination phenomena in artificial intelligence systems: one for Vietnamese text-based LLMs and another for vision-LLMs (LVLMs). Both resources were released under public, research-oriented initiatives and are independently documented: the ViHallu dataset for Vietnamese LLMs (Nguyen et al., 8 Jan 2026), and the ViHallu-Instruction dataset for vision-centric hallucination mitigation (Dai et al., 29 Jul 2025). Each is described below in its construction, taxonomy, evaluation, and contributions to the study of hallucination in neural models.
1. Vietnamese LLM Hallucination: ViHallu Dataset
1.1. Construction and Annotation Pipeline
The ViHallu dataset was created as part of the DSC2025 ViHallu Challenge to facilitate the systematic evaluation of hallucination detection in Vietnamese LLMs (Nguyen et al., 8 Jan 2026). The source corpus was UIT-ViQuAD 2.0, containing over 35,000 Wikipedia-sourced question–answer pairs. Stratified random sampling was used to select ~10,000 passages (each 88–1,500 tokens). For each passage, three prompt variants were generated:
- Factual prompts: Template-based questions (e.g., “DNA là gì?”).
- Noisy prompts: Lexically perturbed via token/diacritic modifications (e.g., “DNA la gi?”).
- Adversarial prompts: Synthesized by an LLM to embed misleading logic or contradictory premises.
Responses were generated using GPT-4o under deterministic decoding. A randomly selected 10% subset was human-validated for grounding and fluency; output distributions were shown to be model-agnostic through cross-model validation (GPT-4o-mini, GPT-4.1-mini).
Label annotation involved 12 Vietnamese-native NLP annotators working in batches (~970 samples/batch), assigning one of three classes: “no hallucination,” “intrinsic,” or “extrinsic.” Ambiguity was resolved via annotator confidence ratings and written justifications. Peer-review covered 10–15% of samples, with disagreements adjudicated by a senior reviewer. Automated scripts checked data integrity (UTF-8, unique IDs, strict JSON) and supervisors spot-checked 5% per batch.
1.2. Dataset Characteristics and Statistics
ViHallu comprises exactly 10,000 annotated triplets of (context, prompt, response) samples. Data is partitioned into train (7,000), public test (1,000), and private test (2,000) splits, maintaining near-equal (~33%) proportions for the three hallucination classes to prevent label skew.
| Split | No Hallucination | Intrinsic | Extrinsic |
|---|---|---|---|
| Train (7,000) | 2,245 | 2,448 | 2,307 |
| Public Test | 334 | 344 | 322 |
| Private Test | 690 | 672 | 638 |
Each context yields one prompt of each type (factual, noisy, adversarial), with ~3,333 samples per prompt type. Passages average ~180 tokens, prompts ~27 tokens, and responses ~40 tokens. Token distributions are consistent and overlap closely across splits, as confirmed by kernel-density estimation.
1.3. Formal Label Taxonomy
Given context , prompt , and response , labels are defined as:
- : is fully entailed by
- : directly contradicts ; formally,
- : makes claims not supported by ; i.e., contains with
These distinctions align the dataset with prior taxonomies in faithfulness, mapping “intrinsic” cases to natural language inference (contradiction detection) and “extrinsic” to grounding.
1.4. Evaluation Metrics and Results
Macro-F1 over the three classes is the principal ranking metric, with accuracy as a secondary measure. Letting , Macro-F1 is computed as , with standard precision and recall definitions per class.
The baseline PhoBERT encoder-only model (trained as a three-way classifier) achieved Macro-F1 ≈ 0.30. Leading systems, notably a Qwen3-4B-Instruct model with LoRA and structured prompting, achieved Macro-F1 = 84.80% on the private test set. Methods employing instruction-tuned LLMs, LoRA adapters, and voting/ensemble strategies dominated the leaderboard; the top seven systems all exceeded 84% Macro-F1, a more than 50-point gain over the baseline.
1.5. Methodological Insights and Limitations
Performance remains notably lower for intrinsic hallucinations, attributed to the need for fine-grained semantic (contradiction) inference, in contrast to the more accessible retrieval-style checks for extrinsic hallucinations. Common error patterns include subtle entity swaps and negation errors. The challenge highlights future priorities such as:
- Developing contrastive objectives tailored to contradiction versus entailment detection
- Confidence calibration for robust deployment
- Span-level annotation for hallucination localization
- Cross-lingual extensions to under-represented languages in Southeast Asia
The dataset is distributed under CC-BY-SA 4.0 and serves as a foundation for trustworthiness benchmarks for Vietnamese models (Nguyen et al., 8 Jan 2026).
2. Vision-Language Hallucination Mitigation: ViHallu-Instruction Dataset
2.1. Motivation and Design Principles
The ViHallu-Instruction dataset addresses hallucinations in large vision-LLMs (LVLMs), where responses may depart from the image evidence due to misalignment between visual and textual modalities (Dai et al., 29 Jul 2025). The resource provides paired “original” and “visual-variation” images, with associated QA pairs specifically constructed to force fine-grained visual–semantic grounding and expose hallucination risks.
Dataset construction emphasizes:
- Generation of visual variation images with controlled, localized changes (e.g., “brown horse” → “chestnut mare”) while retaining global scene layout
- Inclusion of counterfactual edits (e.g., placing "zebra" in a desert) to disrupt correlational biases
- Focusing on object existence, attribute, and relation-based hallucination categories
2.2. Composition and Data Fields
The dataset comprises:
- 1,719 original images, sampled from COCO, GQA, A-OKVQA
- 5,051 high-quality visual variations (~4.2 per original on average, initially 7,209 generated and filtered by VQAScore ≥ 0.6)
- 6,770 total images
- ~50,000 QA pairs (mean 7.4 per image)
QA pairs are annotated by category (object, attribute, relation) and expert vote count. No prescribed train/validation/test splits exist; researchers use the full image set for instruction tuning, or stratify by image type or hallucination category for ablation.
| Data Field | Description |
|---|---|
| image_id | Unique identifier for each image |
| variation_flag | {"original", "variation"} |
| segmentation_mask | Path to mask for manipulated region |
| caption_orig/var | Text descriptions before and after edit |
| question, answer | QA pairs targeting factuality |
| hallucination_cat. | {"object", "attribute", "relation"} |
| expert_vote_count | Number of affirmative validation responses |
2.3. Visual Variation Generation Pipeline
- Segmentation mask extraction: MobileSAM used to isolate regions for editing.
- Base captioning: Tag2Text generates initial description ().
- LLM-based caption editing: DeepSeek-chat V2 produces variants by swapping tokens in .
- Text-to-image synthesis: ControlNet++ conditions diffusion-based synthesis () on mask and edited captions.
- Filtering: Images are kept where VQAScore(x_var, C_var) ≥ 0.6 as computed by LLaVA-1.5-13B.
2.4. QA Instruction Construction and Validation
Each image receives a set of LVLM-generated descriptions and object tags (via Grounded-SAM). Questions are generated (DeepSeek-chat V2) following existence, attribute, counting, and relational templates. Answers are grounded using InternVL2.5-38B, purposefully ignoring hallucinated content in prior descriptions.
QA validation employs an ensemble of LLaVA-1.5, MiniCPM-V2.6, and mPLUG-OWL3 to filter out ambiguous pairs via majority voting on a binary ("yes/no") factuality prompt; ~20% samples are filtered at this stage.
2.5. Benchmarks, Evaluation, and Usage
ViHallu-Instruction is used to instruction-tune LVLMs such as LLaVA-1.5 (7B), MiniGPT-4 v2, and Qwen2-VL (7B). Its impact is measured across several evaluation suites:
- POPE: Object hallucination probing (accuracy up +1.2–5.2pp, F1 up +1.3–1.6pp)
- LLaVA-Bench: Improvements of 8.2–14.4pp across tasks (conversation, detail, complex reasoning)
- MMHal-Bench: Score increases of 9.3–23.9%; hallucination rate reductions of 5.7–8.2%
2.6. Relation to Other Datasets
ViHallu-Instruction distinguishes itself by:
- Providing aligned image edits + matched QA (vision-centric), not just text negatives or synthetic captions as in LLaVA, ALLaVA, InstructBLIP
- Supporting fine-grained attribute/relation hallucination, in contrast to existing datasets that focus on coarse captioning or bounding boxes
- Including counterfactual samples to debias model priors
Dataset files (images, masks, captions, and instructions in JSONL) and code are publicly available, with recommendations for use involving Python 3.8+, PyTorch ≥ 2.0, and sufficient GPU resources (≥1 A100) (Dai et al., 29 Jul 2025).
3. Significance and Impact
ViHallu (text-based and vision-based) resources represent first-of-kind, large-scale, and systematically validated benchmarks for hallucination detection in Vietnamese LLMs and for hallucination mitigation in LVLMs, respectively (Nguyen et al., 8 Jan 2026, Dai et al., 29 Jul 2025). The release of these datasets has catalyzed substantial methodological progress in instruction tuning, ensemble modeling, and data-centric evaluation for robust factuality.
For Vietnamese LLMs, the dataset establishes the task as a three-class classification problem underpinned by formal entailment and contradiction criteria, enabling the quantification and targeted improvement of model faithfulness. The vision-centric resource operationalizes fine-grained visual reasoning challenges, supporting both object- and relation-level grounding, and addresses domain-specific hallucination tendencies not readily exposed by existing text-centric datasets.
4. Challenges and Open Directions
Both ViHallu datasets expose persistent challenges:
- Intrinsic hallucination detection in text models requires nuanced contradiction inference beyond retrieval or simple grounding, with errors often traced to semantic subtleties (e.g., entity swaps, negation).
- Vision-language hallucination mitigation remains limited by the need for models to generalize beyond common co-occurrence priors and attend to subtle scene details.
- Annotator consistency for fine-grained judgment, especially in multilingual or low-resource contexts, remains critical for reliability.
- Extensions to span-level annotation, confidence calibration, and expansion to other Southeast Asian languages or multimodal domains are active areas for future work.
5. Data Access and Licensing
- ViHallu (text): Publicly released under CC-BY-SA 4.0 by the DSC2025 ViHallu Challenge (Nguyen et al., 8 Jan 2026).
- ViHallu-Instruction (vision): Available via https://github.com/oliviadzy/ViHallu, with structured image directories, JSONL-formatted QA instructions, and segmentation mask resources (Dai et al., 29 Jul 2025).
By supporting systematic development and evaluation of hallucination-robust multilingual LLMs and LVLMs, these resources establish a rigorous foundation for future research in factuality, grounding, and trustworthiness.