NExT-QA: Causal & Temporal VideoQA
- NExT-QA is a VideoQA benchmark that emphasizes explicit causal and temporal reasoning using richly annotated, real-world video clips.
- The dataset comprises 5,440 videos with phased annotations for causal, temporal, and descriptive questions across multi-choice and open-ended formats.
- Performance evaluations reveal a significant gap between current automated models and human reasoning, highlighting the need for advanced multimodal fusion techniques.
NExT-QA is a video question answering (VideoQA) benchmark engineered to catalyze a transition from basic visual scene description to the explicit reasoning about causal and temporal dynamics within videos. Diverging from predecessors whose primary focus was on static object/action recognition or simple counting, NExT-QA foregrounds the inference of causes, intentions, and the temporal ordering of events by means of rigorously annotated datasets and carefully constructed question types. The benchmark serves both multi-choice and open-ended QA formulations, directly evaluating models’ ability to invoke deeper reasoning about real-world interactions, and exposes systematic weaknesses in current VideoQA architectures with respect to causal and temporal inference (Xiao et al., 2021, Guda et al., 2024).
1. Motivations and Scope
Existing VideoQA datasets such as MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA primarily test pattern-matching on shallow scene description (“what is the man doing?” or “how many times did the bird flap its wings?”), which can often be solved by visual or textual feature retrieval. NExT-QA is instead designed to rigorously probe causal (“why is the dog barking?”) and temporal reasoning (“what did she do before picking up the phone?”), reflecting real-world scenarios where understanding hinges on inferring chains of cause and effect, and the order of multi-agent activities. The central research questions include: Can current models move beyond superficial recognition to explain visible causes? How robustly can they reason about previous, current, or next actions given complex interactions? Do models generalize to free-form generation absent fixed answer choices?
2. Dataset Construction and Annotation Protocol
NExT-QA samples 5,440 real-world videos from the VidOR dataset, selected for their richness in object and action interactions. The videos average 44 seconds and depict daily-life situations including family, pet, and outdoor scenes. The dataset is partitioned into 3,870 training, 570 validation, and 1,000 test instances (approximately 7:1:2). Annotation is undertaken by 100 trained workers in three staged phases (causal → temporal → descriptive), with strict quality controls: question/answer lengths capped (≤22/≤6 words), balanced answer types (yes/no and count answers ≤20), and staged separation of questioners and answerers to reduce bias.
Content Distribution
- Total Q-A pairs: 52,044 open-ended, 47,692 multi-choice (five options each)
- Question Type Breakdown: Causal (48%), Temporal (29%), Descriptive (23%)
- Question/Answer Lengths: µ±σ = 11.6±4.8 (Q), 2.6±1.4 (A); causal/temporal questions longer (~13 words) than descriptive (~8 words)
- Example QA pairs:
- Causal: “Why is the toddler crying?” → “He fell on the floor.”
- Temporal: “What did the mother do after the child dropped the toy?” → “She picked it up.”
- Descriptive: “Where is the bicycle leaning against?” → “Against the fence.” (Xiao et al., 2021)
3. Task Definitions and Evaluation Metrics
Multi-Choice QA
Given a video , question , and five candidate answers , models select , where is the scoring function. Training deploys a hinge ranking loss:
where is the correct answer, are distractors, and the margin. Main evaluation metric is % accuracy.
Open-Ended QA (Generation)
The model encodes and decodes a sequence , maximizing
Training objective is cross-entropy; generation is scored by WUPS—a variant of Wu–Palmer similarity that considers lexical overlap:
4. Benchmark Models and Methodological Advances
Baseline Architectures
- Video Features: Sampled as 16×16-frame clips; appearance via ResNet-101 (ImageNet), motion via 3D ResNeXt-101 (Kinetics)
- Language Features: GloVe (300d), BERT-base (768d, fine-tuned for multi-choice)
- BlindQA (text-only) achieves ~44% accuracy; SOTA methods (HGA + BERT-FT) reach ~49.7%.
State-of-the-Art Models
- HGA: Heterogeneous graph alignment reasoning, top performer on NExT-QA
- STVQA, EVQA, CoMem/HME, PSAC, HCRN: Diverse attention and memory network architectures with multi-modal fusion
- Pairwise Cross Modal Aggregation (PCMA) and Multimodal Action Grounding (MAG/MAR): Recent advances employing temporal encoding, action description fusion, and attention mechanisms (Guda et al., 2024)
| Model | Acc_C (%) | Acc_T (%) | Acc_D (%) | Overall (%) |
|---|---|---|---|---|
| ATP | 39.32 | 44.23 | 45.17 | 41.81 |
| PCMA | 44.38 | 42.99 | 61.65 | 46.27 |
| PCMA+MAR-32 | 44.80 | 45.78 | 63.96 | 48.10 |
| EIGV | 51.29 | 53.11 | 62.78 | 53.74 |
| EIGV+MAR-32 | 52.64 | 52.58 | 64.63 | 54.59 |
| EIGV+F2+MAR-32 | 53.09 | 53.78 | 62.56 | 54.86 |
| Human | 87.6 | 88.6 | 90.4 | ~90 |
Recent empirical work substantiates the performance gains of methods that intelligently sample and encode key video segments and captions, such as PCMA (+4.5 pp over ATP) and EIGV+MAR+MNSE achieving 54.86% (state-of-the-art) (Guda et al., 2024).
5. Causal and Temporal Reasoning Methodologies
Action and Event Grounding
- Multimodal Action Grounding (MAG/MAR): Combines pretrained action recognition (TimeSformer on Kinetics-400), SwinBERT video captioning, and Moment DETR-based timestamping to select the top-N salient clips. The final features are encoded by CLIP, fusing “ACTION [SEP] DESCRIPTION [SEP] QUESTION [SEP] ANSWER” into a single BERT vector (Guda et al., 2024).
Causal Interventions
- EIGV: Enforces invariant/equivariant answers under perturbations to non-causal/causal frames via anchor-positive-negative triplet constrastive learning.
- Multimodal Robust Intervener (MRI) + MNSE: Augments intervention robustness via FAISS-based mining of nearest-neighbor scenes, improving answer stability under “do(.)” interventions (e.g., EIGV+MNSE: performance drop under unseen interventions is reduced by ~2 pp) (Guda et al., 2024).
Frame Sampling
- Smart Sampling (S3): Teacher–student networks score and select salient frames, reinforced by policies that trade off VQA loss and computational cost. Empirical gains are modest, limited by computational demands.
6. Quantitative Results and Failure Analysis
Systematic evaluation reveals a persistent 35–40 pp gap in causal and temporal QA between state-of-the-art models and humans (SOTA: ~54.86%; human: ~90%), with performance especially weak on longer or open-ended questions. Descriptive questions are routinely solved, while causal/temporal tasks are degraded by plausible but incorrect inferences and misordering of events (e.g., “He is hungry” vs. ground truth “He fell”) (Xiao et al., 2021, Guda et al., 2024). Models often exploit distractor elimination in multi-choice QA rather than authentic reasoning, and generalization collapses when forced to answer without options.
7. Challenges and Prospective Research Directions
Key obstacles highlighted by NExT-QA include enforcing consistency across multiple temporally linked QA pairs, achieving systematic generalization to novel object-action combinations, and integrating external commonsense knowledge without undermining grounding in visible causes. Graph-based reasoning models (e.g., HGA), end-to-end cross-modal fusion, and leveraging VidOR’s object–relation annotations for explicit causal/temporal graph supervision are identified as promising avenues. Further, learned subsampling and dynamic fusion of modality representations remain underexplored for scalable, robust reasoning (Xiao et al., 2021, Guda et al., 2024).
A plausible implication is that benchmarks such as NExT-QA will continue to shape future research in VideoQA by compelling architectures to transition from high-performance feature aggregation to true causal and temporal reasoning, thereby narrowing the substantial gap between automated and human video understanding.