Back to stories
Research

Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Michael Ouroumis2 min read
Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Moonshot AI's Kimi team has published a research paper that could reshape how large language models are built from the ground up. The technique, called Attention Residuals, reworks one of the most fundamental components in modern AI architecture — the residual connection — and delivers significant efficiency gains in both training and inference.

The paper, posted to arXiv on March 16, has already drawn attention from the research community and was presented by Moonshot AI founder Yang Zhilin at NVIDIA's GTC 2026 conference in San Jose.

Rethinking Residual Connections

Residual connections have been a core building block in virtually every deep learning model since the mid-2010s. They work by adding a layer's input directly to its output, creating shortcut paths that help gradients flow through very deep networks.

The problem, according to the Kimi team, is that this fixed accumulation approach leads to a phenomenon called PreNorm dilution — where useful representations from earlier layers get washed out as the network grows deeper.

Attention Residuals (AttnRes) solves this by replacing the fixed addition with softmax attention over preceding layer outputs. Instead of passively stacking representations, each layer can selectively pull the most relevant information from any earlier layer using learned, input-dependent weights.

Measurable Performance Gains

The results are substantial. According to the paper, AttnRes delivers a 1.25x compute advantage over standard transformer architectures and enables roughly 2x more model capability per unit of compute spent on training data.

To address the memory overhead of attending over all preceding layers in very large models, the team also introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations. This variant preserves most of the gains while keeping memory requirements manageable at scale.

Kimi Linear: A Hybrid Architecture

The research also introduced Kimi Linear, a hybrid attention architecture trained on 1.4 trillion tokens that combines the Attention Residuals approach with linear attention mechanisms. The result is 5-6x faster inference compared to standard transformer models of equivalent capability.

Open Source Release

Moonshot AI has open-sourced the implementation on GitHub, making it available for the broader research community to build upon. The paper is available on arXiv (2603.15031).

Implications for the Field

While many recent AI breakthroughs have focused on scaling models larger or training on more data, Attention Residuals takes a different approach — making the architecture itself more efficient. If the technique proves robust across different model sizes and tasks, it could reduce the enormous compute costs that currently limit who can train frontier AI models, potentially shifting the competitive dynamics of the industry.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy
Research

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

A Harvard Medical School study published in Science finds OpenAI's o1 model matched or beat attending physicians at diagnostic and management reasoning across 76 emergency department cases — but the authors warn against removing humans from care.

22 hours ago3 min read
ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3
Research

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

A new ARC Prize Foundation analysis of 160 replays shows OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 stay below 1% on ARC-AGI-3 because of three recurring failure modes — and they fail differently.

2 days ago3 min read
MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors
Research

MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors

MIT CSAIL's Federated Tiny Training Engine reports 81% faster training, 80% less on-device memory, and 69% smaller communication payloads — putting privacy-preserving AI training within reach of small edge hardware.

3 days ago3 min read