Papers
Topics
Authors
Recent
Search
2000 character limit reached

NExT-QA: Causal & Temporal VideoQA

Updated 23 January 2026
  • NExT-QA is a VideoQA benchmark that emphasizes explicit causal and temporal reasoning using richly annotated, real-world video clips.
  • The dataset comprises 5,440 videos with phased annotations for causal, temporal, and descriptive questions across multi-choice and open-ended formats.
  • Performance evaluations reveal a significant gap between current automated models and human reasoning, highlighting the need for advanced multimodal fusion techniques.

NExT-QA is a video question answering (VideoQA) benchmark engineered to catalyze a transition from basic visual scene description to the explicit reasoning about causal and temporal dynamics within videos. Diverging from predecessors whose primary focus was on static object/action recognition or simple counting, NExT-QA foregrounds the inference of causes, intentions, and the temporal ordering of events by means of rigorously annotated datasets and carefully constructed question types. The benchmark serves both multi-choice and open-ended QA formulations, directly evaluating models’ ability to invoke deeper reasoning about real-world interactions, and exposes systematic weaknesses in current VideoQA architectures with respect to causal and temporal inference (Xiao et al., 2021, Guda et al., 2024).

1. Motivations and Scope

Existing VideoQA datasets such as MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA primarily test pattern-matching on shallow scene description (“what is the man doing?” or “how many times did the bird flap its wings?”), which can often be solved by visual or textual feature retrieval. NExT-QA is instead designed to rigorously probe causal (“why is the dog barking?”) and temporal reasoning (“what did she do before picking up the phone?”), reflecting real-world scenarios where understanding hinges on inferring chains of cause and effect, and the order of multi-agent activities. The central research questions include: Can current models move beyond superficial recognition to explain visible causes? How robustly can they reason about previous, current, or next actions given complex interactions? Do models generalize to free-form generation absent fixed answer choices?

2. Dataset Construction and Annotation Protocol

NExT-QA samples 5,440 real-world videos from the VidOR dataset, selected for their richness in object and action interactions. The videos average 44 seconds and depict daily-life situations including family, pet, and outdoor scenes. The dataset is partitioned into 3,870 training, 570 validation, and 1,000 test instances (approximately 7:1:2). Annotation is undertaken by 100 trained workers in three staged phases (causal → temporal → descriptive), with strict quality controls: question/answer lengths capped (≤22/≤6 words), balanced answer types (yes/no and count answers ≤20), and staged separation of questioners and answerers to reduce bias.

Content Distribution

  • Total Q-A pairs: 52,044 open-ended, 47,692 multi-choice (five options each)
  • Question Type Breakdown: Causal (48%), Temporal (29%), Descriptive (23%)
  • Question/Answer Lengths: µ±σ = 11.6±4.8 (Q), 2.6±1.4 (A); causal/temporal questions longer (~13 words) than descriptive (~8 words)
  • Example QA pairs:
    • Causal: “Why is the toddler crying?” → “He fell on the floor.”
    • Temporal: “What did the mother do after the child dropped the toy?” → “She picked it up.”
    • Descriptive: “Where is the bicycle leaning against?” → “Against the fence.” (Xiao et al., 2021)

3. Task Definitions and Evaluation Metrics

Multi-Choice QA

Given a video vv, question qq, and five candidate answers {ai}i=15\{a_i\}_{i=1}^5, models select a^=argmaxis(v,q,ai)\hat a = \arg\max_{i} s(v,q,a_i), where s()s(\cdot) is the scoring function. Training deploys a hinge ranking loss:

L=imax(0,Δs(v,q,a+)+s(v,q,ai)),\mathcal{L} = \sum_i \max(0,\, \Delta - s(v,q,a^+) + s(v,q,a_i^-)),

where a+a^+ is the correct answer, aia_i^- are distractors, and Δ\Delta the margin. Main evaluation metric is % accuracy.

Open-Ended QA (Generation)

The model encodes (v,q)(v,q) and decodes a sequence a1aTa_1\ldots a_T, maximizing

P(av,q)=t=1TP(ata<t,v,q).P(a|v,q) = \prod_{t=1}^T P(a_t|a_{<t},v,q).

Training objective is cross-entropy; generation is scored by WUPS—a variant of Wu–Palmer similarity that considers lexical overlap:

WUPS(P,R)=100×min{pPmaxrRWUP(p,r), rRmaxpPWUP(r,p)}.\mathrm{WUPS}(P,R) = 100 \times \min\Big\{\prod_{p\in P}\max_{r\in R}\mathrm{WUP}(p,r),\ \prod_{r\in R}\max_{p\in P}\mathrm{WUP}(r,p)\Big\}.

(Xiao et al., 2021)

4. Benchmark Models and Methodological Advances

Baseline Architectures

  • Video Features: Sampled as 16×16-frame clips; appearance via ResNet-101 (ImageNet), motion via 3D ResNeXt-101 (Kinetics)
  • Language Features: GloVe (300d), BERT-base (768d, fine-tuned for multi-choice)
  • BlindQA (text-only) achieves ~44% accuracy; SOTA methods (HGA + BERT-FT) reach ~49.7%.

State-of-the-Art Models

  • HGA: Heterogeneous graph alignment reasoning, top performer on NExT-QA
  • STVQA, EVQA, CoMem/HME, PSAC, HCRN: Diverse attention and memory network architectures with multi-modal fusion
  • Pairwise Cross Modal Aggregation (PCMA) and Multimodal Action Grounding (MAG/MAR): Recent advances employing temporal encoding, action description fusion, and attention mechanisms (Guda et al., 2024)
Model Acc_C (%) Acc_T (%) Acc_D (%) Overall (%)
ATP 39.32 44.23 45.17 41.81
PCMA 44.38 42.99 61.65 46.27
PCMA+MAR-32 44.80 45.78 63.96 48.10
EIGV 51.29 53.11 62.78 53.74
EIGV+MAR-32 52.64 52.58 64.63 54.59
EIGV+F2+MAR-32 53.09 53.78 62.56 54.86
Human 87.6 88.6 90.4 ~90

Recent empirical work substantiates the performance gains of methods that intelligently sample and encode key video segments and captions, such as PCMA (+4.5 pp over ATP) and EIGV+MAR+MNSE achieving 54.86% (state-of-the-art) (Guda et al., 2024).

5. Causal and Temporal Reasoning Methodologies

Action and Event Grounding

  • Multimodal Action Grounding (MAG/MAR): Combines pretrained action recognition (TimeSformer on Kinetics-400), SwinBERT video captioning, and Moment DETR-based timestamping to select the top-N salient clips. The final features are encoded by CLIP, fusing “ACTION [SEP] DESCRIPTION [SEP] QUESTION [SEP] ANSWER” into a single BERT vector (Guda et al., 2024).

Causal Interventions

  • EIGV: Enforces invariant/equivariant answers under perturbations to non-causal/causal frames via anchor-positive-negative triplet constrastive learning.
  • Multimodal Robust Intervener (MRI) + MNSE: Augments intervention robustness via FAISS-based mining of nearest-neighbor scenes, improving answer stability under “do(.)” interventions (e.g., EIGV+MNSE: performance drop under unseen interventions is reduced by ~2 pp) (Guda et al., 2024).

Frame Sampling

  • Smart Sampling (S3): Teacher–student networks score and select salient frames, reinforced by policies that trade off VQA loss and computational cost. Empirical gains are modest, limited by computational demands.

6. Quantitative Results and Failure Analysis

Systematic evaluation reveals a persistent 35–40 pp gap in causal and temporal QA between state-of-the-art models and humans (SOTA: ~54.86%; human: ~90%), with performance especially weak on longer or open-ended questions. Descriptive questions are routinely solved, while causal/temporal tasks are degraded by plausible but incorrect inferences and misordering of events (e.g., “He is hungry” vs. ground truth “He fell”) (Xiao et al., 2021, Guda et al., 2024). Models often exploit distractor elimination in multi-choice QA rather than authentic reasoning, and generalization collapses when forced to answer without options.

7. Challenges and Prospective Research Directions

Key obstacles highlighted by NExT-QA include enforcing consistency across multiple temporally linked QA pairs, achieving systematic generalization to novel object-action combinations, and integrating external commonsense knowledge without undermining grounding in visible causes. Graph-based reasoning models (e.g., HGA), end-to-end cross-modal fusion, and leveraging VidOR’s object–relation annotations for explicit causal/temporal graph supervision are identified as promising avenues. Further, learned subsampling and dynamic fusion of modality representations remain underexplored for scalable, robust reasoning (Xiao et al., 2021, Guda et al., 2024).

A plausible implication is that benchmarks such as NExT-QA will continue to shape future research in VideoQA by compelling architectures to transition from high-performance feature aggregation to true causal and temporal reasoning, thereby narrowing the substantial gap between automated and human video understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NExT-QA.