Back to stories
Research

OScaR Pushes KV Cache to 2-Bit: 5.3x Less Memory, 4.1x More Throughput, Near-Lossless

Michael Ouroumis2 min read
OScaR Pushes KV Cache to 2-Bit: 5.3x Less Memory, 4.1x More Throughput, Near-Lossless

A team led by researchers at Meituan's LongCat division, Tsinghua University and the University of Hong Kong has open-sourced OScaR, a KV cache quantization framework that runs attention at INT2 (2-bit) while staying within roughly a point of full-precision accuracy — and cuts decode-time memory by 5.3x versus a BF16 FlashDecoding-v2 baseline. The paper, "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" (arXiv:2605.19660), ships with CUDA kernels and is targeted squarely at the memory bottleneck that dominates long-context and multimodal serving.

What OScaR actually does

The KV cache — the stored keys and values every token attends to — grows linearly with context length and is the single largest consumer of GPU memory during decode. OScaR (Omni-Scaled Canalized Rotation) attacks the hardest part of compressing it: Token Norm Imbalance, where a handful of tokens carry outsized magnitude and blow up quantization error at low bit-widths.

It's a deliberately minimal, two-step pipeline. Canalized Rotation applies a fast Hadamard transform (O(d log d), not O(d²)) to redistribute outlier channel energy, and Omni-Token Scaling does sequence-level normalization to balance token norms. Keys are quantized per-channel, values per-token, with a 128-token residual window kept at higher precision. Critically, it's training-free and calibration-free — no fine-tuning pass, no dataset curation.

The numbers

Measured against BF16 FlashDecoding-v2 on Qwen3-8B at a 128K context length on an H20 GPU, OScaR reports up to 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput. On accuracy, it tops every competing 2-bit method on LongBench-E — 41.75% versus the second-best (OTT) at 40.74% on Llama-3.1-8B, against a 43.57% 16-bit baseline — while beating KIVI, QuaRot and TurboQuant+. On Needle-in-a-Haystack it actually edges the FP16 baseline, 96.5% to 96.0%. The kernels build on HadaCore and BitDecoding and run on GPU Tensor Cores.

The "and Beyond" in the title matters for builders: the authors validate it not just on text LLMs but on multimodal (LLaVA-v1.6, Qwen3-VL-4B/8B) and omni-modal models (Qwen3-Omni-30B-A3B), where KV pressure is worse.

Why it matters for builders

HBM is scarce and expensive, and KV cache — not weights — is what caps concurrent sessions and context length on a serving node. A drop-in, training-free method that quarters KV memory lets you serve longer contexts or fatter batches on the same Hopper-class GPUs, directly improving tokens-per-dollar. The fact that this comes out of Meituan's production LongCat team, not a pure-research lab, signals it was built against real serving economics. With code public, the practical question is integration cost into vLLM/SGLang-style stacks — but for anyone running long-context inference at scale, 2-bit KV cache just got a credible reference implementation.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

AI Inverse-Designs New Gallium Semiconductors With Targeted Band Gaps
Research

AI Inverse-Designs New Gallium Semiconductors With Targeted Band Gaps

A Flinders University–Khalifa University team used Bayesian optimization to generate multiple new gallium-based semiconductor compounds with specified band gaps that don't exist in any current materials database, published May 25 in ACS Materials Letters.

3 hours ago2 min read
Iranian APT Built Malware With AI Mid-War: Check Point Flags MiniFast's LLM Fingerprints
Research

Iranian APT Built Malware With AI Mid-War: Check Point Flags MiniFast's LLM Fingerprints

Check Point Research says IRGC-linked Nimbus Manticore shipped a new Windows backdoor, MiniFast, showing clear signs of LLM-assisted coding during the active US-Iran conflict — alongside AppDomain hijacking and SEO-poisoned download lures.

7 hours ago2 min read
DeepMind's AlphaProof Nexus Cracks 9 Open Erdős Problems With Lean-Verified Proofs
Research

DeepMind's AlphaProof Nexus Cracks 9 Open Erdős Problems With Lean-Verified Proofs

Google DeepMind's AlphaProof Nexus agent produced Lean-checked proofs for 9 of 353 open Erdős problems and 44 OEIS conjectures at a few hundred dollars each — a milestone for verifier-in-the-loop AI reasoning.

13 hours ago2 min read