Back to stories
Research

MIT Researchers Achieve 1M Token Context With Constant Memory Usage

Michael Ouroumis2 min read
MIT Researchers Achieve 1M Token Context With Constant Memory Usage

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have published a paper introducing a novel attention mechanism that maintains constant memory usage regardless of input length. The technique allows transformer models to process contexts of over one million tokens on a single GPU.

The Breakthrough

Traditional transformer attention scales quadratically with sequence length — doubling the context window quadruples memory usage. This fundamental limitation has been the primary barrier to longer context windows in production models.

The MIT team's approach, called "Streaming Sparse Attention" (SSA), replaces the dense attention matrix with a learned sparse representation that identifies which tokens are most relevant to each other. Instead of computing attention across all token pairs, SSA maintains a fixed-size "attention budget" that dynamically allocates computation where it matters most.

How It Works

SSA operates through three key mechanisms:

Relevance Scoring

A lightweight network scores each token's relevance to the current query in a single forward pass. Only tokens above a learned threshold participate in attention computation.

Memory Consolidation

As the context grows, older tokens are progressively consolidated into summary representations. These summaries preserve the semantic content while using a fraction of the memory. The consolidation is learned end-to-end, so the model decides what information to preserve and what to compress.

Anchor Points

The model maintains a set of "anchor" tokens that are never consolidated — typically the beginning of the input, recent tokens, and tokens that have been frequently attended to. This ensures that critical context is always available at full resolution.

Benchmarks

The researchers evaluated SSA on several long-context tasks:

Limitations

The paper is transparent about current limitations. SSA introduces a small accuracy loss on tasks requiring exact attention to specific positions in mid-range contexts (8K-32K tokens). The relevance scorer also adds latency to the first token generation, though subsequent tokens are generated faster than standard attention.

Implications

If SSA proves robust in production settings, it could fundamentally change how AI applications are built. Combined with Stanford's Sparse Cascading Attention, which cuts memory by 60%, these approaches suggest we're entering a new era of dramatically more efficient transformers. Use cases that are currently impractical — processing entire codebases, analyzing book-length documents, maintaining conversation history over weeks — would become feasible on standard hardware.

Several major AI labs have already reached out to the research team to explore integration into their model architectures.

More in Research

AI2 Releases OLMo Hybrid: Combining Transformers and RNNs for 2x Data Efficiency
Research

AI2 Releases OLMo Hybrid: Combining Transformers and RNNs for 2x Data Efficiency

The Allen Institute for AI releases OLMo Hybrid, a fully open 7B model that blends transformer attention with linear recurrent layers, achieving the same accuracy as OLMo 3 using 49% fewer tokens.

8 hours ago2 min read
DeepMind's AlphaCode 3 Beats 99% of Competitive Programmers
Research

DeepMind's AlphaCode 3 Beats 99% of Competitive Programmers

Google DeepMind releases AlphaCode 3, an AI system that performs at the 99th percentile on Codeforces, effectively matching the level of the world's top competitive programmers.

1 day ago2 min read
Stanford Study: AI Tutoring Doubled Student Test Scores in Six Months
Research

Stanford Study: AI Tutoring Doubled Student Test Scores in Six Months

A Stanford-led randomized controlled trial finds that students using AI tutoring systems for 30 minutes daily scored twice as high on standardized math assessments compared to a control group, the strongest evidence yet for AI in education.

1 day ago3 min read