Papers
Topics
Authors
Recent
Search
2000 character limit reached

CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

Published 1 Apr 2026 in cs.CV | (2604.00677v1)

Abstract: Video LLMs (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

Summary

  • The paper introduces a benchmark that enforces significant distribution shifts to realistically assess catastrophic forgetting in Video-LLMs.
  • It implements a comprehensive 6-protocol evaluation framework, measuring performance, computational efficiency, and storage overhead across diverse tasks.
  • Empirical results reveal a stability-plasticity trade-off where specialized anti-forgetting methods often compromise generalization on held-out benchmarks.

CL-VISTA: A Rigorous Benchmark for Continual Learning in Video-LLMs

Motivation and Problem Formulation

Current video understanding benchmarks for continual learning are fundamentally misaligned with the capabilities of large-scale Video-LLMs, due to two key limitations: (1) capability bias—benchmarking on weakly- or non-pretrained models; and (2) distribution bias—partitioning a single dataset by superficial question types induces high task redundancy and fails to produce realistic forgetting. As quantitatively assessed, modern Video-LLMs such as Video-LLaVA exhibit negligible catastrophic forgetting on legacy benchmarks, indicating that these settings do not meaningfully test continual learning. In contrast, the introduction of CL-VISTA enforces significant distributional shifts between tasks, demonstrated by pronounced performance drops and negative backward transfer (BWT) under sequential fine-tuning. Figure 1

Figure 1: Significant performance gap in existing benchmarks is exposed only by CL-VISTA, which induces substantial forgetting under sequential fine-tuning.

A t-SNE–based analysis of joint video-question embeddings further reveals that CL-VISTA achieves much clearer task separation and greater inter-task distances than prior benchmarks, confirming its efficacy in mitigating distributional and capability bias. Figure 2

Figure 2: Embeddings on CL-VISTA display high inter-task distance, in stark contrast to previous benchmarks with severe entanglement.

In training trajectory analysis, sequential task learning on CL-VISTA evokes sharp loss spikes at task boundaries, confirming considerable shifts, whereas prior datasets show homogeneous optimization curves, signifying trivial task transitions. Figure 3

Figure 3: Under CL-VISTA, loss spikes at task boundaries reflect non-trivial distribution shifts.

CL-VISTA Benchmark Design and Task Suite

CL-VISTA comprises eight tasks spanning three foundational dimensions:

  • Perception: Counting, Space
  • Understanding: Traffic, Movie, GUI, Science, Sports
  • Reasoning: STAR-based distributed temporal and causal reasoning

The benchmark integrates open datasets and a dedicated high-quality video QA data generation pipeline supported by multi-discriminator filtering for rigorous domain-specific QA synthesis. Figure 4

Figure 4: The CL-VISTA benchmark—modular, extensible, and supporting systemic evaluation protocols.

Figure 5

Figure 5: Dual-discriminator video-based QA data generation and refinement pipeline.

Visualization of individual tasks highlights the heterogeneity and complexity of the curated benchmark, with significant diversity in scenario, domain, and instruction type. Figure 6

Figure 6

Figure 6: Counting and spatial tasks in CL-VISTA.

Figure 7

Figure 7

Figure 7: Traffic and movie domain tasks.

Figure 8

Figure 8

Figure 8: GUI and science specialized QA tasks.

Figure 9

Figure 9

Figure 9: Sports and temporal reasoning tasks.

CL-VISTA is designed for the latest Video-LLMs, especially Video-LLaVA and VideoLLaMA2, both of which possess significant pre-training on large-scale video data.

Evaluation Framework: Metrics and Protocols

CL-VISTA implements a comprehensive 6-protocol evaluation framework across three axes:

  • Performance: Standard CL metrics (MFT, MFN, MAA, BWT) and general video understanding via evaluation on five diverse held-out benchmarks (MMVU, MVBench, NExTQA, LongVideoBench, MMBench-Video).
  • Computational Efficiency: Measures both per-task training and aggregate inference latency.
  • Storage Overhead: Quantifies parameter and memory requirements during training and inference.

A robust LLM-as-Judge protocol (Qwen3-30A3B-Instruct) supresses spurious forgetting inherent in string-matching methods.

Empirical Results and Quantitative Analysis

Standard CL evaluation on Video-LLaVA reveals that most existing multimodal CL methods, including those using regularization and architecture expansion (e.g., RegLoRA, HiDe, MoELoRA, CL-MoE), are unable to control forgetting under the challenge imposed by CL-VISTA. Only task-specific adaptation and replay-based methods, such as DISCO and MR-LoRA, attain minimal negative BWT.

However, these task-specialized approaches display a trade-off between stability and plasticity: although memory of seen tasks is preserved, generalization to held-out benchmarks is often worse than simple LoRA-FT sequential fine-tuning. For instance, MR-LoRA and DISCO, despite strong anti-forgetting, cause relative drops of 4.94% and 5.98% on general video understanding, especially on NExTQA, highlighting detrimental distributional overspecialization.

Training and inference efficiency are also diverse: regularization-based and parameter-expansion methods (e.g., SMoLoRA, MoELoRA, RegLoRA) entail substantial time and memory costs, with RegLoRA imposing extreme storage due to parameter protection overheads. Figure 10

Figure 10: Training and inference computational efficiency of CL algorithms.

Figure 11

Figure 11: Parameter size overhead for major CL-Video-LLM strategies during training and inference.

This tension renders none of the tested approaches universally optimal: state-of-the-art forgetting mitigation is inevitably achieved at the cost of generalization and resource requirements.

Implications and Future Directions

The rigorous distributional properties and multidimensional evaluation of CL-VISTA reveal that:

  • Existing CL algorithms for Video-LLMs are highly susceptible to trade-offs between generality and forgetting, especially when confronted with realistic distribution shifts.
  • Parameter-expansion and architectural modularity control forgetting at the expense of significant computation and overfitting to specific data regimes.
  • Progressive, curriculum-aligned learning sequences can enhance fundamental generalization but are vulnerable to forgetting.
  • Effective continual learning for Video-LLMs requires new advances in scalable, adaptive, and generalizable representation consolidation—beyond static mixture-of-experts or regularization paradigms—as well as more efficient routing, lifelong memory integration, or meta-continual learning protocols.

By establishing a rigorous, standardized, and scalable testbed, CL-VISTA provides an essential foundation for the evaluation and development of practical, generalizable continual learning for Video-LLMs and other future multimodal foundation models.

Conclusion

CL-VISTA sets a new standard for continual learning in Video-LLMs, exposing the inadequacy of current methodologies and evaluation protocols for modern models. Its challenging task heterogeneity, distributional robustness, and multidimensional evaluation not only provoke catastrophic forgetting but also reveal limitations in generalization and practical deployment. The open-source release of CL-VISTA, along with a unified codebase and leaderboard, constitutes a necessary infrastructure for advancing research on robust, scalable, and efficient continual video understanding systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.