DanmakuTPP-QA: Multimodal TPP Benchmark

Updated 8 February 2026

The paper introduces a benchmark designed to evaluate integrated temporal, textual, and visual reasoning over Danmaku event streams.
A multi-agent pipeline leveraging LLMs and vision models facilitates precise annotation, quality control, and adaptive sequence compression.
Baseline models reveal that while larger LLMs boost temporal prediction accuracy, effective multimodal fusion remains a critical challenge.

DanmakuTPP-QA is a large-scale, multi-modal question answering benchmark specifically constructed to evaluate temporal, textual, and visual reasoning over event sequences derived from “Danmaku” (bullet comment) streams on video platforms. This benchmark is designed to facilitate the development and analysis of models capable of integrated reasoning across timestamped events, associated commentary, and synchronized video frames—tasks at the intersection of temporal point process (TPP) modeling and multi-modal language modeling (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

1. Dataset Motivation and Overview

DanmakuTPP-QA was created to address limitations in prior TPP datasets, which were predominantly unimodal and failed to represent the multimodal event streams encountered on streaming platforms. Real-world Danmaku streams display asynchronous, high-frequency event structures: each event comprises a timestamp ( $t_i$ ), a free-form text comment ( $m_i$ ), and a coincident video frame ( $V_i$ ). Existing TPP models were restricted to temporal and occasionally textual information, neglecting the complex dependencies introduced by visual context. DanmakuTPP-QA directly challenges models to integrate temporal, textual, and visual signals for both predictive and generative reasoning (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

2. Construction Pipeline and Data Characteristics

The benchmark is derived from the DanmakuTPP-Events corpus, which consists of user-generated bullet comments with precise timestamps and aligned video data. DanmakuTPP-QA comprises 2,605 curated video sessions (each with 500–1,500 events, average 967 per video). A multi-agent pipeline, leveraging state-of-the-art LLMs and MLLMs, was developed for dataset creation:

Task-Design Agent: Analyzes raw TPP sequences, designs 10 distinct tasks (8 closed-ended, 2 open-ended), and specifies I/O formats.
Annotation Agents: Use LLMs for textual sentiment/event typing (Qwen2.5), MLLMs and vision models for object tags and scene captions (Qwen2.5-VL, RAM).
Quality-Control Agent: Majority voting and rule-based gap-filling (Qwen3), ensuring cross-modal consistency.
Visualization Agent: Generates event density, sentiment, and event-type trajectory plots.
Task-Solve Agents: Generate reference answers using a combination of LLMs/MLLMs, majority voting, and manual verification for the test set.

Preprocessing filters guarantee event-rich sessions and temporal granularity (text tokenization, timestamp normalization). The dataset is split into training (2,005 samples), validation (300), and test (300) sets (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

3. Task Taxonomy and Input/Output Format

DanmakuTPP-QA contains 10 task types, encompassing both predictive and analytical subtasks:

A. Temporal Reasoning - Burst-peak count (classification) - Next event timestamp prediction (regression) - Next burst peak prediction (regression)

B. Sentiment Dynamics - Average sentiment polarity (regression) - Next event and next burst sentiment prediction (regression)

C. Multimodal Grounding - Next event type inference (classification) - Top-2 trigger types for next burst (multi-label classification)

D. High-Level Reasoning (Open-Ended) - Analysis of global sentiment dynamics - Causal attribution for burst peaks

Inputs to each task consist of a windowed, temporally-ordered sequence of events, selected video frames (3–5 per question), synchronized plots, and a natural language prompt. Outputs are standardized: closed-ended tasks yield integer/multi-class labels or real numbers; open-ended tasks require coherent, free-text multi-sentence analyses (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

4. Evaluation Metrics and Annotation Protocol

Closed-ended tasks are assessed with accuracy (ACC) for classifications and root mean squared error (RMSE) for regression:

$\mathrm{ACC} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(y_i^{\text{pred}} = y_i^{\text{true}})$

$\mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^N (y_i^{\text{pred}} - y_i^{\text{true}})^2 }$

Open-ended tasks are evaluated using LLM-based correctness scores in $[0,1]$ (Qwen3-235B-A22B), capturing relevance, coherence, and depth of multimodal reasoning. For baseline TPP modeling, sequence log-likelihood and perplexity are also used (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

The annotation protocol aggregates multi-agent LLM outputs by majority vote and applies rule-based tie-breaking heuristics. All test-set answers undergo manual verification. Inter-annotator agreement is not numerically reported, but conflict rates are under 5%, indicating high consensus (Jiang et al., 23 May 2025).

5. Baseline Models and Performance Analysis

DanmakuTPP-QA establishes a rigorous testbed for both classical and modern baseline models:

Classical TPP Models: Neural Hawkes Process (NHP), Self-Attentive Hawkes Process (SAHP), Transformer Hawkes Process (THP), Attentive Neural Hawkes Process (AttNHP). These operate on temporal and/or type information, achieving RMSEs in the 0.9–1.0 range for next-event prediction on sequence data.
General LLMs/MLLMs: Qwen2.5-Instruct, Qwen3, Llama-3.3-70B, DeepSeek-V3, Gemma3-27B, with and without vision modules (e.g., Qwen2.5-VL variants), and LoRA finetuned models.

Empirically, larger models (e.g., Qwen3-30B-A3B) demonstrate improved accuracy and RMSE, especially on temporal prediction. However, MLLMs do not consistently outperform unimodal LLMs, suggesting bottlenecks in visual-temporal integration. Notably, LoRA finetuning provides significant performance gains in sentiment tasks but can degrade results in tasks such as next-burst prediction due to overfitting (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

Table: Performance Summary on Closed-Ended Tasks (Test Set)

Model	Best ACC (T-1/T-8)	Best RMSE (T-2/T-3)	Notes
Qwen2.5-VL-3B (finetuned)	27.00 / 43.00	1.35 / 220.43	Sentiment gains, T-3 overfit
MM-TPP-3B	27.33 / 44.00	1.49 / 190.45	State of the art (Li et al., 1 Feb 2026)

For open-ended tasks (T-9, T-10), MM-TPP models produce analytical text outputs rated more coherent and grounded than standard MLLMs, with focus on TPP-relevant inflection points (Li et al., 1 Feb 2026).

6. Technical Innovations and Modeling Framework

The DanmakuTPP-QA benchmark catalyzed the development of new model architectures and sequence handling methodologies:

Multimodal TPP Tokenization: Each event token includes time, type, text, and vision (<|image_pad|>). Visual features are embedded via a vision encoder and aligned with the event stream without patch-based tokenization (Li et al., 1 Feb 2026).
Adaptive Sequence Compression: To manage long event histories inherent to Danmaku streams, the adaptive compression procedure replaces sequences of temporally similar events (gap $|T_i - T_{i-1}| < 0.2$ s) with a <|similar_event|> token. This extends the effective context window, preserves burst patterns, and reduces computational cost.
Two-Stage Training: Stage 1 uses continued pre-training on both compressed and raw sequence data, optimizing standard autoregressive token loss. Stage 2 applies supervised fine-tuning on subtask-specific tokens or generated text for QA.
Ablation Findings: Compression by temporal similarity, as opposed to random drop or pure truncation, substantially preserves predictive signal across sequence lengths. Visual input ablation confirms the necessity of multimodal context for optimal performance (Li et al., 1 Feb 2026).

7. Limitations and Prospects for Future Work

Known limitations of DanmakuTPP-QA include:

All sessions use Chinese-language Danmaku, restricting cross-lingual transferability.
Annotation quality is bounded by current LLM and MLLM capabilities, with potential residual bias.
Open-ended answer evaluation relies on automated LLM scoring, not human judgment.
Vision output remains static; the models do not attempt image synthesis.
Compression is solely temporal; textual or visual feature-based compression may further enhance model efficiency.

Proposed directions include extending to multilingual data, incorporating human experts to explicitly assess annotator agreement (e.g., Cohen’s κ), integrating retrieval-augmented or span-based grounding tasks, and developing multi-modal temporal models that generalize context fusion and retrieval (Jiang et al., 23 May 2025, Li et al., 1 Feb 2026).

DanmakuTPP-QA provides a challenging environment for benchmarking and advancing multi-modal temporal reasoning, establishing robust baselines and design paradigms for subsequent research in multimodal TPPs, LLMs, and event-centric AI.

Markdown Report Issue Upgrade to Chat

References (2)

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding (2025)

Long-range Modeling and Processing of Multimodal Event Sequences (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DanmakuTPP-QA Benchmark.