What are Attention Residuals in AI?

Attention Residuals replace fixed accumulation in transformer residual connections with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights.

How much faster is Moonshot AI's Kimi with Attention Residuals?

The technique delivers roughly 2x more model capability per unit of training compute and a 1.25x compute advantage. Kimi Linear, a hybrid variant trained on 1.4 trillion tokens, achieves 5-6x faster inference.

Is the Attention Residuals code open source?

Yes. Moonshot AI released the code on GitHub under the MoonshotAI/Attention-Residuals repository, and the research paper is available on arXiv.

Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Moonshot AI's Kimi team has published a research paper that could reshape how large language models are built from the ground up. The technique, called Attention Residuals, reworks one of the most fundamental components in modern AI architecture — the residual connection — and delivers significant efficiency gains in both training and inference.

The paper, posted to arXiv on March 16, has already drawn attention from the research community and was presented by Moonshot AI founder Yang Zhilin at NVIDIA's GTC 2026 conference in San Jose.

Rethinking Residual Connections

Residual connections have been a core building block in virtually every deep learning model since the mid-2010s. They work by adding a layer's input directly to its output, creating shortcut paths that help gradients flow through very deep networks.

The problem, according to the Kimi team, is that this fixed accumulation approach leads to a phenomenon called PreNorm dilution — where useful representations from earlier layers get washed out as the network grows deeper.

Attention Residuals (AttnRes) solves this by replacing the fixed addition with softmax attention over preceding layer outputs. Instead of passively stacking representations, each layer can selectively pull the most relevant information from any earlier layer using learned, input-dependent weights.

Measurable Performance Gains

The results are substantial. According to the paper, AttnRes delivers a 1.25x compute advantage over standard transformer architectures and enables roughly 2x more model capability per unit of compute spent on training data.

To address the memory overhead of attending over all preceding layers in very large models, the team also introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations. This variant preserves most of the gains while keeping memory requirements manageable at scale.

Kimi Linear: A Hybrid Architecture

The research also introduced Kimi Linear, a hybrid attention architecture trained on 1.4 trillion tokens that combines the Attention Residuals approach with linear attention mechanisms. The result is 5-6x faster inference compared to standard transformer models of equivalent capability.

Open Source Release

Moonshot AI has open-sourced the implementation on GitHub, making it available for the broader research community to build upon. The paper is available on arXiv (2603.15031).

Implications for the Field

While many recent AI breakthroughs have focused on scaling models larger or training on more data, Attention Residuals takes a different approach — making the architecture itself more efficient. If the technique proves robust across different model sizes and tasks, it could reduce the enormous compute costs that currently limit who can train frontier AI models, potentially shifting the competitive dynamics of the industry.

Moonshot AI's 'Attention Residuals' Rethinks a Core Transformer Building Block

Rethinking Residual Connections

Measurable Performance Gains

Kimi Linear: A Hybrid Architecture

Open Source Release

Implications for the Field

More in Research

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors