Effectiveness of image-text MLLM continual learning methods on Video-LLMs

Determine the effectiveness of mainstream multimodal continual learning methods originally developed for image–text multimodal large language models—Replay, OLoRA, MoELoRA, ModalPrompt, RegLoRA, CL-MoE, HiDe, DISCO, SMoLoRA, and MR-LoRA—when applied to video large language models that require temporal reasoning, such as Video-LLaVA and VideoLLaMA2.

Background

A large body of continual learning techniques for multimodal LLMs has been developed and validated primarily on image–text settings. Video LLMs introduce fundamentally different requirements due to temporal dynamics and reasoning over sequences of frames.

CL-VISTA evaluates ten representative continual learning approaches across Video-LLaVA and VideoLLaMA2, but the transferability of these image–text–oriented methods to video settings with temporal reasoning remains explicitly identified as an open question by the authors.

References

While these methods have shown strong performance on image-text MLLMs, their effectiveness on Video-LLMs, where temporal reasoning introduces fundamentally different demands, remains an open question.

CL-VISTA: Benchmarking Continual Learning in Video Large Language Models  (2604.00677 - Guo et al., 1 Apr 2026) in Section 3.3, Benchmark Setup — Supported Methods