Effectiveness of image-text MLLM continual learning methods on Video-LLMs
Determine the effectiveness of mainstream multimodal continual learning methods originally developed for image–text multimodal large language models—Replay, OLoRA, MoELoRA, ModalPrompt, RegLoRA, CL-MoE, HiDe, DISCO, SMoLoRA, and MR-LoRA—when applied to video large language models that require temporal reasoning, such as Video-LLaVA and VideoLLaMA2.
References
While these methods have shown strong performance on image-text MLLMs, their effectiveness on Video-LLMs, where temporal reasoning introduces fundamentally different demands, remains an open question.
— CL-VISTA: Benchmarking Continual Learning in Video Large Language Models
(2604.00677 - Guo et al., 1 Apr 2026) in Section 3.3, Benchmark Setup — Supported Methods