ViHallu Dataset: Benchmarking AI Hallucinations

Updated 15 January 2026

ViHallu Dataset is a dual-resource benchmark for detecting and mitigating hallucinations in Vietnamese LLMs and LVLMs.
It features a rigorous annotation process with stratified sampling, ensuring balanced evaluation across no, intrinsic, and extrinsic hallucination classes.
The vision-language component leverages controlled image variations with paired QA for fine-grained factuality assessment in LVLMs.

ViHallu Dataset refers to two distinct resources developed for benchmarking and mitigating hallucination phenomena in artificial intelligence systems: one for Vietnamese text-based LLMs and another for vision-LLMs (LVLMs). Both resources were released under public, research-oriented initiatives and are independently documented: the ViHallu dataset for Vietnamese LLMs (Nguyen et al., 8 Jan 2026), and the ViHallu-Instruction dataset for vision-centric hallucination mitigation (Dai et al., 29 Jul 2025). Each is described below in its construction, taxonomy, evaluation, and contributions to the study of hallucination in neural models.

1. Vietnamese LLM Hallucination: ViHallu Dataset

1.1. Construction and Annotation Pipeline

The ViHallu dataset was created as part of the DSC2025 ViHallu Challenge to facilitate the systematic evaluation of hallucination detection in Vietnamese LLMs (Nguyen et al., 8 Jan 2026). The source corpus was UIT-ViQuAD 2.0, containing over 35,000 Wikipedia-sourced question–answer pairs. Stratified random sampling was used to select ~10,000 passages (each 88–1,500 tokens). For each passage, three prompt variants were generated:

Factual prompts: Template-based questions (e.g., “DNA là gì?”).
Noisy prompts: Lexically perturbed via token/diacritic modifications (e.g., “DNA la gi?”).
Adversarial prompts: Synthesized by an LLM to embed misleading logic or contradictory premises.

Responses were generated using GPT-4o under deterministic decoding. A randomly selected 10% subset was human-validated for grounding and fluency; output distributions were shown to be model-agnostic through cross-model validation (GPT-4o-mini, GPT-4.1-mini).

Label annotation involved 12 Vietnamese-native NLP annotators working in batches (~970 samples/batch), assigning one of three classes: “no hallucination,” “intrinsic,” or “extrinsic.” Ambiguity was resolved via annotator confidence ratings and written justifications. Peer-review covered 10–15% of samples, with disagreements adjudicated by a senior reviewer. Automated scripts checked data integrity (UTF-8, unique IDs, strict JSON) and supervisors spot-checked 5% per batch.

1.2. Dataset Characteristics and Statistics

ViHallu comprises exactly 10,000 annotated triplets of (context, prompt, response) samples. Data is partitioned into train (7,000), public test (1,000), and private test (2,000) splits, maintaining near-equal (~33%) proportions for the three hallucination classes to prevent label skew.

Split	No Hallucination	Intrinsic	Extrinsic
Train (7,000)	2,245	2,448	2,307
Public Test	334	344	322
Private Test	690	672	638

Each context yields one prompt of each type (factual, noisy, adversarial), with ~3,333 samples per prompt type. Passages average ~180 tokens, prompts ~27 tokens, and responses ~40 tokens. Token distributions are consistent and overlap closely across splits, as confirmed by kernel-density estimation.

1.3. Formal Label Taxonomy

Given context $C$ , prompt $P$ , and response $R$ , labels are defined as:

$L = \text{no}$ : $R$ is fully entailed by $C$
$L = \text{intrinsic}$ : $R$ directly contradicts $C$ ; formally, $C \land R \models \bot$
$P$ 0: $P$ 1 makes claims not supported by $P$ 2; i.e., $P$ 3 contains $P$ 4 with $P$ 5

These distinctions align the dataset with prior taxonomies in faithfulness, mapping “intrinsic” cases to natural language inference (contradiction detection) and “extrinsic” to grounding.

1.4. Evaluation Metrics and Results

Macro-F1 over the three classes is the principal ranking metric, with accuracy as a secondary measure. Letting $P$ 6, Macro-F1 is computed as $P$ 7, with standard precision and recall definitions per class.

The baseline PhoBERT encoder-only model (trained as a three-way classifier) achieved Macro-F1 ≈ 0.30. Leading systems, notably a Qwen3-4B-Instruct model with LoRA and structured prompting, achieved Macro-F1 = 84.80% on the private test set. Methods employing instruction-tuned LLMs, LoRA adapters, and voting/ensemble strategies dominated the leaderboard; the top seven systems all exceeded 84% Macro-F1, a more than 50-point gain over the baseline.

1.5. Methodological Insights and Limitations

Performance remains notably lower for intrinsic hallucinations, attributed to the need for fine-grained semantic (contradiction) inference, in contrast to the more accessible retrieval-style checks for extrinsic hallucinations. Common error patterns include subtle entity swaps and negation errors. The challenge highlights future priorities such as:

Developing contrastive objectives tailored to contradiction versus entailment detection
Confidence calibration for robust deployment
Span-level annotation for hallucination localization
Cross-lingual extensions to under-represented languages in Southeast Asia

The dataset is distributed under CC-BY-SA 4.0 and serves as a foundation for trustworthiness benchmarks for Vietnamese models (Nguyen et al., 8 Jan 2026).

2. Vision-Language Hallucination Mitigation: ViHallu-Instruction Dataset

2.1. Motivation and Design Principles

The ViHallu-Instruction dataset addresses hallucinations in large vision-LLMs (LVLMs), where responses may depart from the image evidence due to misalignment between visual and textual modalities (Dai et al., 29 Jul 2025). The resource provides paired “original” and “visual-variation” images, with associated QA pairs specifically constructed to force fine-grained visual–semantic grounding and expose hallucination risks.

Dataset construction emphasizes:

Generation of visual variation images with controlled, localized changes (e.g., “brown horse” → “chestnut mare”) while retaining global scene layout
Inclusion of counterfactual edits (e.g., placing "zebra" in a desert) to disrupt correlational biases
Focusing on object existence, attribute, and relation-based hallucination categories

2.2. Composition and Data Fields

The dataset comprises:

1,719 original images, sampled from COCO, GQA, A-OKVQA
5,051 high-quality visual variations (~4.2 per original on average, initially 7,209 generated and filtered by VQAScore ≥ 0.6)
6,770 total images
~50,000 QA pairs (mean 7.4 per image)

QA pairs are annotated by category (object, attribute, relation) and expert vote count. No prescribed train/validation/test splits exist; researchers use the full image set for instruction tuning, or stratify by image type or hallucination category for ablation.

Data Field	Description
image_id	Unique identifier for each image
variation_flag	{"original", "variation"}
segmentation_mask	Path to mask for manipulated region
caption_orig/var	Text descriptions before and after edit
question, answer	QA pairs targeting factuality
hallucination_cat.	{"object", "attribute", "relation"}
expert_vote_count	Number of affirmative validation responses

2.3. Visual Variation Generation Pipeline

Segmentation mask extraction: MobileSAM used to isolate regions for editing.
Base captioning: Tag2Text generates initial description ( $P$ 8).
LLM-based caption editing: DeepSeek-chat V2 produces variants by swapping tokens in $P$ 9.
Text-to-image synthesis: ControlNet++ conditions diffusion-based synthesis ( $R$ 0) on mask and edited captions.
Filtering: Images are kept where VQAScore(x_var, C_var) ≥ 0.6 as computed by LLaVA-1.5-13B.

2.4. QA Instruction Construction and Validation

Each image receives a set of LVLM-generated descriptions and object tags (via Grounded-SAM). Questions are generated (DeepSeek-chat V2) following existence, attribute, counting, and relational templates. Answers are grounded using InternVL2.5-38B, purposefully ignoring hallucinated content in prior descriptions.

QA validation employs an ensemble of LLaVA-1.5, MiniCPM-V2.6, and mPLUG-OWL3 to filter out ambiguous pairs via majority voting on a binary ("yes/no") factuality prompt; ~20% samples are filtered at this stage.

2.5. Benchmarks, Evaluation, and Usage

ViHallu-Instruction is used to instruction-tune LVLMs such as LLaVA-1.5 (7B), MiniGPT-4 v2, and Qwen2-VL (7B). Its impact is measured across several evaluation suites:

POPE: Object hallucination probing (accuracy up +1.2–5.2pp, F1 up +1.3–1.6pp)
LLaVA-Bench: Improvements of 8.2–14.4pp across tasks (conversation, detail, complex reasoning)
MMHal-Bench: Score increases of 9.3–23.9%; hallucination rate reductions of 5.7–8.2%

2.6. Relation to Other Datasets

ViHallu-Instruction distinguishes itself by:

Providing aligned image edits + matched QA (vision-centric), not just text negatives or synthetic captions as in LLaVA, ALLaVA, InstructBLIP
Supporting fine-grained attribute/relation hallucination, in contrast to existing datasets that focus on coarse captioning or bounding boxes
Including counterfactual samples to debias model priors

Dataset files (images, masks, captions, and instructions in JSONL) and code are publicly available, with recommendations for use involving Python 3.8+, PyTorch ≥ 2.0, and sufficient GPU resources (≥1 A100) (Dai et al., 29 Jul 2025).

3. Significance and Impact

ViHallu (text-based and vision-based) resources represent first-of-kind, large-scale, and systematically validated benchmarks for hallucination detection in Vietnamese LLMs and for hallucination mitigation in LVLMs, respectively (Nguyen et al., 8 Jan 2026, Dai et al., 29 Jul 2025). The release of these datasets has catalyzed substantial methodological progress in instruction tuning, ensemble modeling, and data-centric evaluation for robust factuality.

For Vietnamese LLMs, the dataset establishes the task as a three-class classification problem underpinned by formal entailment and contradiction criteria, enabling the quantification and targeted improvement of model faithfulness. The vision-centric resource operationalizes fine-grained visual reasoning challenges, supporting both object- and relation-level grounding, and addresses domain-specific hallucination tendencies not readily exposed by existing text-centric datasets.

4. Challenges and Open Directions

Both ViHallu datasets expose persistent challenges:

Intrinsic hallucination detection in text models requires nuanced contradiction inference beyond retrieval or simple grounding, with errors often traced to semantic subtleties (e.g., entity swaps, negation).
Vision-language hallucination mitigation remains limited by the need for models to generalize beyond common co-occurrence priors and attend to subtle scene details.
Annotator consistency for fine-grained judgment, especially in multilingual or low-resource contexts, remains critical for reliability.
Extensions to span-level annotation, confidence calibration, and expansion to other Southeast Asian languages or multimodal domains are active areas for future work.

5. Data Access and Licensing

ViHallu (text): Publicly released under CC-BY-SA 4.0 by the DSC2025 ViHallu Challenge (Nguyen et al., 8 Jan 2026).
ViHallu-Instruction (vision): Available via https://github.com/oliviadzy/ViHallu, with structured image directories, JSONL-formatted QA instructions, and segmentation mask resources (Dai et al., 29 Jul 2025).

By supporting systematic development and evaluation of hallucination-robust multilingual LLMs and LVLMs, these resources establish a rigorous foundation for future research in factuality, grounding, and trustworthiness.

Markdown Report Issue Upgrade to Chat

References (2)

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs (2026)

See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViHallu Dataset.