What's the headline efficiency number?

MMProLong extends Qwen2.5-VL-7B from a 32K to a 128K context window on a 5B-token continued-pretraining budget — roughly 2,900 H20 GPU-hours — and still beats 27B–38B open models on long-document VQA.

Does training on document Q&A really beat OCR transcription?

Yes. The best VQA mixture (extract-multi) hit a 56.90 average versus 52.44 for the strongest OCR-needle setup, and the VQA-trained model generalized better to unseen layouts while processing fewer tokens per page.

What data mixture and length distribution does the paper recommend?

An 8:2 extraction-to-reasoning split (40% single-page extract, 40% multi-page extract, 20% reasoning) scored highest at 57.70. A balanced 'pool-native' length distribution beat data biased toward the 128K target length.

ByteDance's MMProLong Recipe Hits 128K Context on a 7B VLM for 2,900 GPU-Hours — and Q&A Beats OCR

A new training recipe from ByteDance Seed and HKUST takes Qwen2.5-VL-7B from a 32K to a 128K-token context window on a continued-pretraining budget of just 5 billion tokens — about 2,900 H20 GPU-hours — and the resulting model, MMProLong, outscores open vision-language models four to five times its size on long-document understanding. The paper ("Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context," arXiv 2605.13831) lands as teams scramble to push multimodal models past single-document limits without retraining from scratch.

Q&A training beats OCR transcription

The central result is a data-design finding, not a new architecture. The authors compared two ways of continuing pretraining on long PDFs: OCR-style transcription (reproduce every page, or retrieve and transcribe a few pages among distractors) versus long-document visual question answering (extract a single fact, aggregate across pages, or reason numerically across the document). VQA won cleanly. The best VQA setup, extract-multi, reached a 56.90 average — a 6.3-point lift over the Qwen2.5-VL-7B baseline — while the strongest OCR variant, OCR-needle, managed only 52.44 (+1.9). The VQA-trained model also generalized better to unseen document formats and burned fewer tokens per image. As the authors frame it, query-based training teaches the model to prioritize relevance over completeness.

The numbers

On MMLongBench, MMProLong posted a 59.56% average at 64K (vs. 52.24% baseline) and 55.84% at 128K (vs. 48.94%) — roughly a 7-point gain. Component scores at 128K include SlideVQA at 77.00% (up from 68.00%) and LongDocURL at 56.33%. The recipe held up well past its training window with no extra training: 55.09% at 256K (baseline 38.12%) and 52.52% at 512K, where the baseline collapsed to 19.49%. Cross-task transfer was strong too — MM-NIAH jumped to 49.4 from 20.0.

The training pool was deliberately large and varied: 1,537,504 PDFs spanning 20–200 pages (36.6M pages total, 96% English). The authors land on an 8:2 extraction-to-reasoning mixture (57.70 average) and show a balanced length distribution beats data skewed toward 128K. Crucially, pure long-document VQA preserves short-context skill — average dropped only from 66.47 to 65.48 with zero short-context data.

What changes for builders

For anyone shipping document AI, RAG over scanned filings, or multimodal agents, the takeaway is concrete: you can buy a 4x context extension and beat InternVL3-38B (48.88) and Gemma3-27B (52.63) with a 7B backbone for a few thousand GPU-hours — provided you train on synthetic Q&A rather than transcription. The recipe replicated on Qwen3-VL-8B, suggesting it ports across backbones. It won't catch closed frontier models like Gemini-3.1-Pro on aggregate, but for self-hosted long-document pipelines, the cost-to-capability math just shifted.

ByteDance's MMProLong Recipe Hits 128K Context on a 7B VLM for 2,900 GPU-Hours — and Q&A Beats OCR

Q&A training beats OCR transcription

The numbers

What changes for builders

More in Research

METR: Frontier Labs' Internal Agents Could Already Launch Small 'Rogue Deployments'

OpenAI Reasoning Model Disproves 80-Year-Old Erdős Conjecture, a First for Autonomous AI Math

NASA's New AI-Ready Spaceflight Chip Hits 100x Performance in JPL Tests