A team led by researchers at Meituan's LongCat division, Tsinghua University and the University of Hong Kong has open-sourced OScaR, a KV cache quantization framework that runs attention at INT2 (2-bit) while staying within roughly a point of full-precision accuracy — and cuts decode-time memory by 5.3x versus a BF16 FlashDecoding-v2 baseline. The paper, "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" (arXiv:2605.19660), ships with CUDA kernels and is targeted squarely at the memory bottleneck that dominates long-context and multimodal serving.
What OScaR actually does
The KV cache — the stored keys and values every token attends to — grows linearly with context length and is the single largest consumer of GPU memory during decode. OScaR (Omni-Scaled Canalized Rotation) attacks the hardest part of compressing it: Token Norm Imbalance, where a handful of tokens carry outsized magnitude and blow up quantization error at low bit-widths.
It's a deliberately minimal, two-step pipeline. Canalized Rotation applies a fast Hadamard transform (O(d log d), not O(d²)) to redistribute outlier channel energy, and Omni-Token Scaling does sequence-level normalization to balance token norms. Keys are quantized per-channel, values per-token, with a 128-token residual window kept at higher precision. Critically, it's training-free and calibration-free — no fine-tuning pass, no dataset curation.
The numbers
Measured against BF16 FlashDecoding-v2 on Qwen3-8B at a 128K context length on an H20 GPU, OScaR reports up to 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput. On accuracy, it tops every competing 2-bit method on LongBench-E — 41.75% versus the second-best (OTT) at 40.74% on Llama-3.1-8B, against a 43.57% 16-bit baseline — while beating KIVI, QuaRot and TurboQuant+. On Needle-in-a-Haystack it actually edges the FP16 baseline, 96.5% to 96.0%. The kernels build on HadaCore and BitDecoding and run on GPU Tensor Cores.
The "and Beyond" in the title matters for builders: the authors validate it not just on text LLMs but on multimodal (LLaVA-v1.6, Qwen3-VL-4B/8B) and omni-modal models (Qwen3-Omni-30B-A3B), where KV pressure is worse.
Why it matters for builders
HBM is scarce and expensive, and KV cache — not weights — is what caps concurrent sessions and context length on a serving node. A drop-in, training-free method that quarters KV memory lets you serve longer contexts or fatter batches on the same Hopper-class GPUs, directly improving tokens-per-dollar. The fact that this comes out of Meituan's production LongCat team, not a pure-research lab, signals it was built against real serving economics. With code public, the practical question is integration cost into vLLM/SGLang-style stacks — but for anyone running long-context inference at scale, 2-bit KV cache just got a credible reference implementation.


