EgoAVU-Bench: Egocentric Audio-Visual Benchmark
- EgoAVU-Bench is a benchmark for egocentric audio-visual understanding featuring 3,000 QA pairs from 900 videos to rigorously assess multi-modal reasoning.
- It evaluates both open- and closed-ended tasks—such as sound-source association, temporal reasoning, and detailed narration—to ensure comprehensive model testing.
- Performance evaluations reveal a vision bias in current MLLMs, with fine-tuning notably improving audio grounding and integrated perception.
EgoAVU-Bench is a manually verified evaluation benchmark designed for rigorous assessment of egocentric audio-visual understanding by multi-modal LLMs (MLLMs). Built as part of the EgoAVU pipeline, EgoAVU-Bench systematically tests joint reasoning over audio and visual modalities, with special emphasis on grounded, temporally aligned, and multimodal queries that span open- and closed-ended task formats. The suite includes 3,000 question–answer (QA) pairs over 900 egocentric videos. Existing leading MLLMs, when evaluated on EgoAVU-Bench, consistently demonstrate a vision-biased failure mode, struggling especially with audio grounding and sound–source association. Fine-tuning on the companion EgoAVU-Instruct dataset yields substantial performance gains, highlighting the benchmark's role in catalyzing progress on joint egocentric perception and reasoning (Seth et al., 5 Feb 2026).
1. Dataset Structure and Task Taxonomy
EgoAVU-Bench targets held-out, test-only evaluation with no training or validation splits. It comprises:
- 900 egocentric videos (duration 1–6 minutes, average ~4 minutes; frame resolution 256×256, 1 FPS), audio rate inherited from Ego4D (16–48 kHz), and a test suite of 3,000 manually curated QA items.
- Video narrations in the construction were sourced from the larger EgoAVU-Instruct set (9,900 training videos, ~3 million samples), but no overlap is permitted in Bench itself.
Task coverage is articulated across five categories:
- Open-Ended Tasks
- Source–Sound Association (SSA): Identify each salient foreground or background sound and map it to its visible source (object or action).
- Audio–Visual Segment Narration (AVSN): For a specified interval (e.g., 240–250 sec), compose a coherent, multimodal narration detailing what transpired, including perceptual and action-event content.
- Audio–Visual Dense Narration (AVDN): Densely summarize the entirety of a video integrating timestamped actions, objects, and sounds.
- Closed-Ended Tasks
- Temporal Reasoning (TR): Multiple-choice queries on the ordering of multimodal events ("before/after," "which occurred first/last").
- Audio–Visual Hallucination (AVH): Binary (Yes/No) questions designed to probe for spurious or hallucinatory audio, object, or action detection.
This taxonomy is designed to probe model understanding of cross-modal event correlation, temporal contextualization, and hallucination suppression.
2. Data Generation and Curation Pipeline
EgoAVU-Bench QA pairs are curated via a modular four-stage pipeline:
- Narration Enhancement
- Uni-modal captioning: Qwen2.5-VL for center-frame object lists, Qwen2.5-Omni for action-centric video captions (visual), and for foreground/background sound descriptions (audio-only).
- Token-Based Filtering for Lexical Diversity
- Moving-Average Type–Token Ratio (MATTR): For each narration ,
- Only the top 75% of videos are retained () to enforce narrative variety.
- Graph-Based Multi-Modal Context Curation (MCG)
- LLaMA-70B parses all uni-modal narrations into structured JSON (interacted objects, background objects, sound attributes: acoustic description, source, evidence, foreground/background) to form a Multi-Modal Context Graph.
- Use of MCG reduces caption errors dramatically (41% → 10.5%, a –76.1% relative drop), essential for robust fact- and modality-grounded QA.
- QA Synthesis
- LLaMA-70B is prompted with fused narration+MCG content to synthesize SSA, AVSN, AVDN, TR, AVH QA pairs, via templates ensuring a balanced distribution across modalities and reasoning types.
Subsequently, all 3,000 QA pairs undergo manual inspection (by trained annotators; ~225 person-hours), with 50.8% requiring substantive edits—principally in open-ended sound tag fidelity, distractor plausibility/challenge for closed-ended tasks, and enforcing strict temporal alignment.
3. Evaluation Protocols and Scoring Metrics
Task-specific scoring metrics in EgoAVU-Bench are:
- Closed-Ended Tasks (TR, AVH):
- Accuracy, computed as .
- Open-Ended Tasks (SSA, AVSN, AVDN):
- LLM-as-Judge Score (S): Integer from 1–5 measuring factuality/alignment (~87.6% human–LLM agreement).
- METEOR (M): Harmonic mean of unigram precision and recall, with penalty for brevity,
- ROUGE-L (R): Based on LCS between hypothesis (H) and reference (R),
Optional metrics for future expansion:
- Precision, recall, F1, and IoU for grounding tasks.
This suite enables nuanced, multi-granular evaluation spanning both holistic summary and fine-grained compositional reasoning.
4. Baseline Model Performance and Error Profile
A comparative evaluation of contemporary open-source MLLMs reveals consistent underperformance in joint audio-visual reasoning when tested on EgoAVU-Bench. Model results are summarized as follows:
| Model | Size | SSA S | AVDN (S/M/R) | AVSN (S/M/R) | TR Acc. | AVH Acc. |
|---|---|---|---|---|---|---|
| VideoLLaMA2 | 7B | 1.51 | 1.88 / 3.65 / 8.32 | 1.71 / 7.50 / 13.89 | 37.00% | 20.32% |
| Baichuan-Omni | 7B | 1.49 | 1.92 / 4.82 / 9.75 | 1.79 / 8.34 / 14.21 | 39.85% | 21.10% |
| Intern-Omni | 8.7B | 1.47 | 1.95 / 5.20 / 10.11 | 1.82 / 8.69 / 14.70 | 41.22% | 21.75% |
| Phi4-mm | 8B | 1.42 | 1.59 / 8.79 / 13.13 | 1.69 / 12.17 / 16.90 | 45.04% | 22.89% |
| MiniCPM-o | 8B | 1.43 | 2.27 / 10.84 /14.77 | 2.06 / 9.68 / 12.19 | 26.44% | 21.76% |
| Qwen2.5-Omni | 3B | 1.45 | 2.17 / 8.55 / 13.45 | 1.85 / 8.63 / 13.08 | 46.40% | 26.28% |
| Qwen2.5-Omni | 7B | 1.50 | 2.37 / 10.69/14.74 | 1.99 / 9.99 / 13.39 | 53.20% | 42.69% |
Key observations:
- All models score below 1.6 on SSA (out of 5) and under 43% (AVH).
- Highest-performing Qwen2.5-Omni-7B remains at S=2.4 (AVDN/AVSN), 53.2% (TR), 42.7% (AVH).
- Error distribution shows a pronounced visual bias: object recognition accuracy > action recognition > sound localization and association.
5. Effects of Fine-Tuning and Component Ablation
Fine-tuning on EgoAVU-Instruct using Qwen2.5-Omni-7B leads to marked performance improvements, summarized below:
| Setting | SSA S | AVDN (S/M/R) | AVSN (S/M/R) | TR Acc. | AVH Acc. |
|---|---|---|---|---|---|
| Base (7B) | 1.50 | 2.37 / 10.69 / 14.74 | 1.99 / 9.99 / 13.39 | 53.20% | 42.69% |
| +LoRA on Instruct | 3.15 | 2.60 / 12.20 / 17.19 | 2.45 / 22.53 / 28.34 | 64.31% | 61.69% |
| +Full Fine-tuning | 3.20 | 2.66 / 12.50 / 17.32 | 2.63 / 22.68 / 28.70 | 67.84% | 60.12% |
| Relative Δ (%) | +113 | +12.2 / +16.9 / +17.2 | +27.6 / +86.5 / +69.8 | +27.2% | +30.8% |
- SSA (sound–source association) improves by 113%.
- Substantial gains on temporal reasoning and hallucination avoidance (TR +27.2%, AVH +30.8%).
Transfer across benchmarks demonstrates similar patterns: up to +28% Δ on EgoTempo, +7.2% on EgoIllusion, with performance on exocentric (non-egocentric) benchmarks such as VideoMME and AVQA largely unchanged.
Ablation of the MCG stage (direct fusion vs. graph-based context) increases error rates (41.0% → 10.5%)—especially for sound-source and action-event alignment—establishing MCG's necessity for robust, non-hallucinatory narration generation.
6. Context, Implications, and Limitations
EgoAVU-Bench exposes a fundamental limitation in contemporary MLLMs: systematic neglect of audio cues and weak correlation between auditory events and their visual sources in egocentric settings. The benchmark, together with its construction pipeline, provides an experimental protocol to quantify and address these issues. Benchmarks employing single-modality or exocentric data do not reveal this failure mode.
A plausible implication is that datasets and evaluation tools emphasizing multi-modality are critical not only for progress in egocentric understanding, but also for robust generalization to downstream embodied intelligence, robotics, and real-world agent applications.
Limitations include the manual curation bottleneck for scaling QA quality, challenges in balancing open-/closed-ended QA mixtures, and optional expansion of metrics (precision/recall/F1/IoU) for future model grounding tasks.
7. Benchmark Impact and Relevance
EgoAVU-Bench constitutes a comprehensive, high-fidelity target for benchmarking joint audio-visual reasoning in egocentric video. Its design and evaluation metrics facilitate both fine-grained and holistic model diagnostics—a critical step toward MLLMs capable of genuine embodied intelligence.
The benchmark has catalyzed the development of new automated data curation methods (MCG, MATTR-based filtering), and has revealed the concrete benefits of large-scale multimodal pretraining and fine-tuning for MLLMs. By facilitating the identification of vision–audio integration gaps, EgoAVU-Bench will likely shape future research on compositionality, cross-modal reasoning, and robust, grounded perception in AI systems (Seth et al., 5 Feb 2026).