Back to stories
Research

ByteDance's MMProLong Recipe Hits 128K Context on a 7B VLM for 2,900 GPU-Hours — and Q&A Beats OCR

Michael Ouroumis2 min read
ByteDance's MMProLong Recipe Hits 128K Context on a 7B VLM for 2,900 GPU-Hours — and Q&A Beats OCR

A new training recipe from ByteDance Seed and HKUST takes Qwen2.5-VL-7B from a 32K to a 128K-token context window on a continued-pretraining budget of just 5 billion tokens — about 2,900 H20 GPU-hours — and the resulting model, MMProLong, outscores open vision-language models four to five times its size on long-document understanding. The paper ("Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context," arXiv 2605.13831) lands as teams scramble to push multimodal models past single-document limits without retraining from scratch.

Q&A training beats OCR transcription

The central result is a data-design finding, not a new architecture. The authors compared two ways of continuing pretraining on long PDFs: OCR-style transcription (reproduce every page, or retrieve and transcribe a few pages among distractors) versus long-document visual question answering (extract a single fact, aggregate across pages, or reason numerically across the document). VQA won cleanly. The best VQA setup, extract-multi, reached a 56.90 average — a 6.3-point lift over the Qwen2.5-VL-7B baseline — while the strongest OCR variant, OCR-needle, managed only 52.44 (+1.9). The VQA-trained model also generalized better to unseen document formats and burned fewer tokens per image. As the authors frame it, query-based training teaches the model to prioritize relevance over completeness.

The numbers

On MMLongBench, MMProLong posted a 59.56% average at 64K (vs. 52.24% baseline) and 55.84% at 128K (vs. 48.94%) — roughly a 7-point gain. Component scores at 128K include SlideVQA at 77.00% (up from 68.00%) and LongDocURL at 56.33%. The recipe held up well past its training window with no extra training: 55.09% at 256K (baseline 38.12%) and 52.52% at 512K, where the baseline collapsed to 19.49%. Cross-task transfer was strong too — MM-NIAH jumped to 49.4 from 20.0.

The training pool was deliberately large and varied: 1,537,504 PDFs spanning 20–200 pages (36.6M pages total, 96% English). The authors land on an 8:2 extraction-to-reasoning mixture (57.70 average) and show a balanced length distribution beats data skewed toward 128K. Crucially, pure long-document VQA preserves short-context skill — average dropped only from 66.47 to 65.48 with zero short-context data.

What changes for builders

For anyone shipping document AI, RAG over scanned filings, or multimodal agents, the takeaway is concrete: you can buy a 4x context extension and beat InternVL3-38B (48.88) and Gemma3-27B (52.63) with a 7B backbone for a few thousand GPU-hours — provided you train on synthetic Q&A rather than transcription. The recipe replicated on Qwen3-VL-8B, suggesting it ports across backbones. It won't catch closed frontier models like Gemini-3.1-Pro on aggregate, but for self-hosted long-document pipelines, the cost-to-capability math just shifted.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

METR: Frontier Labs' Internal Agents Could Already Launch Small 'Rogue Deployments'
Research

METR: Frontier Labs' Internal Agents Could Already Launch Small 'Rogue Deployments'

METR's first Frontier Risk Report finds the internal agents at Anthropic, Google, Meta, and OpenAI could already initiate small 'rogue deployments' but can't yet sustain them — and that a large fraction of agent activity goes unreviewed by any human.

7 min ago2 min read
OpenAI Reasoning Model Disproves 80-Year-Old Erdős Conjecture, a First for Autonomous AI Math
Research

OpenAI Reasoning Model Disproves 80-Year-Old Erdős Conjecture, a First for Autonomous AI Math

An unreleased OpenAI general-purpose reasoning model disproved Erdős's planar unit distance conjecture, constructing point sets with at least n^(1+δ) unit-distance pairs. Fields Medalist Tim Gowers called it 'a milestone in AI mathematics.'

3 days ago2 min read
NASA's New AI-Ready Spaceflight Chip Hits 100x Performance in JPL Tests
Research

NASA's New AI-Ready Spaceflight Chip Hits 100x Performance in JPL Tests

NASA's Jet Propulsion Laboratory says its next-generation High Performance Spaceflight Computing processor, built with Microchip Technology, is running at roughly 100x the power of today's space chips, opening the door to onboard AI for Moon, Mars and deep-space missions.

1 week ago2 min read