Reproducing long-document VQA capabilities of recent open-weight VLMs

Determine reproducible training recipes and data strategies for long-context vision-language models that enable state-of-the-art long-document visual question answering performance comparable to Qwen3 VL and GLM 4.5/6V, whose training procedures are currently underspecified.

Background

Recent open-weight models, including Qwen3 VL and GLM 4.5/6V, have surpassed closed models such as GPT-4o on long-document visual question answering benchmarks like MMLongBenchDoc. However, the public descriptions of their training setups and data pipelines are incomplete, making it difficult for practitioners to replicate their performance.

This paper is motivated by the need to bridge this gap by systematically studying continued pretraining, supervised finetuning, and preference optimization for strong long-context visual document understanding, but it explicitly notes the broader field-level uncertainty regarding how to reproduce these capabilities from the limited information available.

References

However, their training recipes and data strategies are underspecified and it remains unclear how to reproduce these capabilities.

How to Train Your Long-Context Visual Document Model  (2602.15257 - Veselka, 16 Feb 2026) in Introduction