Reproducing long-document VQA capabilities of recent open-weight VLMs
Determine reproducible training recipes and data strategies for long-context vision-language models that enable state-of-the-art long-document visual question answering performance comparable to Qwen3 VL and GLM 4.5/6V, whose training procedures are currently underspecified.
References
However, their training recipes and data strategies are underspecified and it remains unclear how to reproduce these capabilities.
— How to Train Your Long-Context Visual Document Model
(2602.15257 - Veselka, 16 Feb 2026) in Introduction