Back to stories
Research

Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Michael Ouroumis2 min read
Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Moonshot AI's Kimi team has published a research paper that could reshape how large language models are built from the ground up. The technique, called Attention Residuals, reworks one of the most fundamental components in modern AI architecture — the residual connection — and delivers significant efficiency gains in both training and inference.

The paper, posted to arXiv on March 16, has already drawn attention from the research community and was presented by Moonshot AI founder Yang Zhilin at NVIDIA's GTC 2026 conference in San Jose.

Rethinking Residual Connections

Residual connections have been a core building block in virtually every deep learning model since the mid-2010s. They work by adding a layer's input directly to its output, creating shortcut paths that help gradients flow through very deep networks.

The problem, according to the Kimi team, is that this fixed accumulation approach leads to a phenomenon called PreNorm dilution — where useful representations from earlier layers get washed out as the network grows deeper.

Attention Residuals (AttnRes) solves this by replacing the fixed addition with softmax attention over preceding layer outputs. Instead of passively stacking representations, each layer can selectively pull the most relevant information from any earlier layer using learned, input-dependent weights.

Measurable Performance Gains

The results are substantial. According to the paper, AttnRes delivers a 1.25x compute advantage over standard transformer architectures and enables roughly 2x more model capability per unit of compute spent on training data.

To address the memory overhead of attending over all preceding layers in very large models, the team also introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations. This variant preserves most of the gains while keeping memory requirements manageable at scale.

Kimi Linear: A Hybrid Architecture

The research also introduced Kimi Linear, a hybrid attention architecture trained on 1.4 trillion tokens that combines the Attention Residuals approach with linear attention mechanisms. The result is 5-6x faster inference compared to standard transformer models of equivalent capability.

Open Source Release

Moonshot AI has open-sourced the implementation on GitHub, making it available for the broader research community to build upon. The paper is available on arXiv (2603.15031).

Implications for the Field

While many recent AI breakthroughs have focused on scaling models larger or training on more data, Attention Residuals takes a different approach — making the architecture itself more efficient. If the technique proves robust across different model sizes and tasks, it could reduce the enormous compute costs that currently limit who can train frontier AI models, potentially shifting the competitive dynamics of the industry.

How AI Actually Works — Free Book on FreeLibrary

A free book that explains the AI concepts behind the headlines — no jargon, just clarity.

More in Research

Basecamp Research Launches Trillion Gene Atlas to Revolutionize AI Drug Discovery
Research

Basecamp Research Launches Trillion Gene Atlas to Revolutionize AI Drug Discovery

Basecamp Research unveiled the Trillion Gene Atlas at NVIDIA GTC, partnering with Anthropic and NVIDIA to expand known genetic diversity by 100x and accelerate AI-designed therapeutics.

18 hours ago2 min read
Microsoft's GigaTIME AI Transforms Cheap Pathology Slides Into Detailed Cancer Maps
Research

Microsoft's GigaTIME AI Transforms Cheap Pathology Slides Into Detailed Cancer Maps

Microsoft Research unveiled GigaTIME, a multimodal AI model that converts standard pathology slides into detailed protein-level tumor maps, analyzed across 14,000 patients and 24 cancer types to uncover over 1,200 clinically significant associations.

3 days ago2 min read
Microsoft's GigaTIME AI Turns Routine Pathology Slides Into Cancer Protein Maps
Research

Microsoft's GigaTIME AI Turns Routine Pathology Slides Into Cancer Protein Maps

Published in Cell, Microsoft's GigaTIME model converts standard $5-10 pathology slides into detailed spatial proteomics maps, enabling population-scale cancer research across dozens of tumor types.

3 days ago2 min read