A new training recipe from ByteDance Seed and HKUST takes Qwen2.5-VL-7B from a 32K to a 128K-token context window on a continued-pretraining budget of just 5 billion tokens — about 2,900 H20 GPU-hours — and the resulting model, MMProLong, outscores open vision-language models four to five times its size on long-document understanding. The paper ("Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context," arXiv 2605.13831) lands as teams scramble to push multimodal models past single-document limits without retraining from scratch.
Q&A training beats OCR transcription
The central result is a data-design finding, not a new architecture. The authors compared two ways of continuing pretraining on long PDFs: OCR-style transcription (reproduce every page, or retrieve and transcribe a few pages among distractors) versus long-document visual question answering (extract a single fact, aggregate across pages, or reason numerically across the document). VQA won cleanly. The best VQA setup, extract-multi, reached a 56.90 average — a 6.3-point lift over the Qwen2.5-VL-7B baseline — while the strongest OCR variant, OCR-needle, managed only 52.44 (+1.9). The VQA-trained model also generalized better to unseen document formats and burned fewer tokens per image. As the authors frame it, query-based training teaches the model to prioritize relevance over completeness.
The numbers
On MMLongBench, MMProLong posted a 59.56% average at 64K (vs. 52.24% baseline) and 55.84% at 128K (vs. 48.94%) — roughly a 7-point gain. Component scores at 128K include SlideVQA at 77.00% (up from 68.00%) and LongDocURL at 56.33%. The recipe held up well past its training window with no extra training: 55.09% at 256K (baseline 38.12%) and 52.52% at 512K, where the baseline collapsed to 19.49%. Cross-task transfer was strong too — MM-NIAH jumped to 49.4 from 20.0.
The training pool was deliberately large and varied: 1,537,504 PDFs spanning 20–200 pages (36.6M pages total, 96% English). The authors land on an 8:2 extraction-to-reasoning mixture (57.70 average) and show a balanced length distribution beats data skewed toward 128K. Crucially, pure long-document VQA preserves short-context skill — average dropped only from 66.47 to 65.48 with zero short-context data.
What changes for builders
For anyone shipping document AI, RAG over scanned filings, or multimodal agents, the takeaway is concrete: you can buy a 4x context extension and beat InternVL3-38B (48.88) and Gemma3-27B (52.63) with a 7B backbone for a few thousand GPU-hours — provided you train on synthetic Q&A rather than transcription. The recipe replicated on Qwen3-VL-8B, suggesting it ports across backbones. It won't catch closed frontier models like Gemini-3.1-Pro on aggregate, but for self-hosted long-document pipelines, the cost-to-capability math just shifted.



