Moonshot AI's Kimi team has published a research paper that could reshape how large language models are built from the ground up. The technique, called Attention Residuals, reworks one of the most fundamental components in modern AI architecture — the residual connection — and delivers significant efficiency gains in both training and inference.
The paper, posted to arXiv on March 16, has already drawn attention from the research community and was presented by Moonshot AI founder Yang Zhilin at NVIDIA's GTC 2026 conference in San Jose.
Rethinking Residual Connections
Residual connections have been a core building block in virtually every deep learning model since the mid-2010s. They work by adding a layer's input directly to its output, creating shortcut paths that help gradients flow through very deep networks.
The problem, according to the Kimi team, is that this fixed accumulation approach leads to a phenomenon called PreNorm dilution — where useful representations from earlier layers get washed out as the network grows deeper.
Attention Residuals (AttnRes) solves this by replacing the fixed addition with softmax attention over preceding layer outputs. Instead of passively stacking representations, each layer can selectively pull the most relevant information from any earlier layer using learned, input-dependent weights.
Measurable Performance Gains
The results are substantial. According to the paper, AttnRes delivers a 1.25x compute advantage over standard transformer architectures and enables roughly 2x more model capability per unit of compute spent on training data.
To address the memory overhead of attending over all preceding layers in very large models, the team also introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations. This variant preserves most of the gains while keeping memory requirements manageable at scale.
Kimi Linear: A Hybrid Architecture
The research also introduced Kimi Linear, a hybrid attention architecture trained on 1.4 trillion tokens that combines the Attention Residuals approach with linear attention mechanisms. The result is 5-6x faster inference compared to standard transformer models of equivalent capability.
Open Source Release
Moonshot AI has open-sourced the implementation on GitHub, making it available for the broader research community to build upon. The paper is available on arXiv (2603.15031).
Implications for the Field
While many recent AI breakthroughs have focused on scaling models larger or training on more data, Attention Residuals takes a different approach — making the architecture itself more efficient. If the technique proves robust across different model sizes and tasks, it could reduce the enormous compute costs that currently limit who can train frontier AI models, potentially shifting the competitive dynamics of the industry.



