Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Published 16 May 2025 in cs.CV and cs.AI | (2505.11326v1)

Abstract: Vision-LLMs (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-LLM with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-LLMs

The paper "Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-LLMs" by Keunwoo Peter Yu and Joyce Chai explores the challenges and opportunities in developing vision-LLMs (VLMs) that are capable of real-time language generation. Unlike traditional VLMs focused on offline tasks, this research emphasizes the necessity for models to operate fluently in dynamic environments where inputs are continuous and where precise timing of responses is crucial.

Key Contributions

Identifying Core Capabilities: The authors outline two critical capabilities required for real-time interactive environments: perceptual updating and contingency awareness. Perceptual updating refers to the model's ability to continuously revise interpretations based on new sensory inputs, while contingency awareness involves adjusting actions based on their consequences.
Introducing TGLG Benchmark: They propose a new benchmark task, Temporally-Grounded Language Generation (TGLG), designed to evaluate these capabilities in models. TGLG challenges models to generate responses that are both semantically meaningful and temporally aligned with streaming visual input, using curated datasets from sports broadcasting and egocentric human interaction domains.
Developing TRACE Metric: The paper introduces the Temporal Responsiveness and Alignment Coherence Evaluation (TRACE), a metric to jointly assess semantic similarity and temporal alignment between generated and ground-truth utterances. This metric provides a comprehensive evaluation of model performance in real-time settings.
Presenting VLM-TSI Model: The researchers propose a new architecture—Vision-LLM with Time-Synchronized Interleaving (VLM-TSI)—which interleaves visual and linguistic tokens along a shared timeline. This approach enables fluid, frame-by-frame language generation without relying on turn-based assumptions, showing significant improvements over existing baselines like VideoLLM-Online.

Experimental Results and Implications

Experimental results indicate that VLM-TSI outperforms VideoLLM-Online under the TRACE metric, demonstrating effective real-time interaction capability. However, performance remains moderate overall, indicating the inherent complexity of the TGLG task. This underlines the need for further research on real-time VLMs.

The implications of this research are significant for practical applications in fields like assistive technology, autonomous systems, and interactive media. The ability to seamlessly integrate language and vision in real-time opens possibilities for more adaptive and responsive AI systems. Theoretically, this work challenges existing paradigms in vision-LLM design, advocating for architectures that can dynamically integrate inputs and outputs with real-world temporal demands.

Future Directions

Moving forward, the authors suggest refining real-time benchmarks and developing more sophisticated metrics that capture detailed aspects of model interaction. Additionally, exploring on-policy evaluations and adaptive metric learning could yield insights into optimizing TGLG across diverse applications.

In conclusion, this paper contributes to the foundational framework necessary for advancing vision-LLMs toward real-time interaction capabilities. The proposed TGLG benchmark and VLM-TSI model represent progressive steps in addressing challenges posed by dynamic environments, providing a robust basis for future AI developments.

Markdown Report Issue