Back to stories
Research

Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Michael Ouroumis2 min read
Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Moonshot AI's Kimi team has published a research paper that could reshape how large language models are built from the ground up. The technique, called Attention Residuals, reworks one of the most fundamental components in modern AI architecture — the residual connection — and delivers significant efficiency gains in both training and inference.

The paper, posted to arXiv on March 16, has already drawn attention from the research community and was presented by Moonshot AI founder Yang Zhilin at NVIDIA's GTC 2026 conference in San Jose.

Rethinking Residual Connections

Residual connections have been a core building block in virtually every deep learning model since the mid-2010s. They work by adding a layer's input directly to its output, creating shortcut paths that help gradients flow through very deep networks.

The problem, according to the Kimi team, is that this fixed accumulation approach leads to a phenomenon called PreNorm dilution — where useful representations from earlier layers get washed out as the network grows deeper.

Attention Residuals (AttnRes) solves this by replacing the fixed addition with softmax attention over preceding layer outputs. Instead of passively stacking representations, each layer can selectively pull the most relevant information from any earlier layer using learned, input-dependent weights.

Measurable Performance Gains

The results are substantial. According to the paper, AttnRes delivers a 1.25x compute advantage over standard transformer architectures and enables roughly 2x more model capability per unit of compute spent on training data.

To address the memory overhead of attending over all preceding layers in very large models, the team also introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations. This variant preserves most of the gains while keeping memory requirements manageable at scale.

Kimi Linear: A Hybrid Architecture

The research also introduced Kimi Linear, a hybrid attention architecture trained on 1.4 trillion tokens that combines the Attention Residuals approach with linear attention mechanisms. The result is 5-6x faster inference compared to standard transformer models of equivalent capability.

Open Source Release

Moonshot AI has open-sourced the implementation on GitHub, making it available for the broader research community to build upon. The paper is available on arXiv (2603.15031).

Implications for the Field

While many recent AI breakthroughs have focused on scaling models larger or training on more data, Attention Residuals takes a different approach — making the architecture itself more efficient. If the technique proves robust across different model sizes and tasks, it could reduce the enormous compute costs that currently limit who can train frontier AI models, potentially shifting the competitive dynamics of the industry.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Researchers Expose 26 Malicious LLM Routers Hijacking AI Agents and Stealing Credentials
Research

Researchers Expose 26 Malicious LLM Routers Hijacking AI Agents and Stealing Credentials

A UC Santa Barbara study of 428 LLM API routers found 26 secretly injecting malicious tool calls, exfiltrating credentials, and draining crypto wallets — exposing a critical blind spot in the AI supply chain.

1 day ago2 min read
AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds
Research

AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds

A JAMA Network Open study of 21 leading AI models found they fail to produce appropriate differential diagnoses more than 80% of the time when patient data is incomplete, despite achieving over 90% accuracy on final diagnoses with full information.

1 day ago2 min read
Stanford AI Index 2026: Capability Is Accelerating, But Benefits Are Concentrating
Research

Stanford AI Index 2026: Capability Is Accelerating, But Benefits Are Concentrating

The Stanford HAI AI Index 2026, released today, reports $581.7B in global corporate AI investment, a 29.6 GW data-center power footprint, and a shrinking US–China capability gap.

3 days ago2 min read