What memory and throughput gains does OScaR deliver?

Against a BF16 FlashDecoding-v2 baseline on Qwen3-8B at 128K context on an H20 GPU, the paper reports up to a 3.0x decoding speedup, a 5.3x reduction in KV cache memory footprint, and a 4.1x throughput increase.

Does the INT2 quantization require retraining or calibration?

No. OScaR is training-free and calibration-free, using just two operations — Canalized Rotation (a Hadamard transform) plus Omni-Token Scaling — and ships optimized CUDA kernels built on HadaCore and BitDecoding with Tensor Core acceleration.

How much accuracy is lost at 2-bit?

Near-lossless in the reported tests. On LongBench-E with Llama-3.1-8B it scored 41.75% versus a 43.57% 16-bit baseline and beat all other 2-bit methods, and on Needle-in-a-Haystack it hit 96.5% retrieval versus the 96.0% FP16 baseline.

OScaR Pushes KV Cache to 2-Bit: 5.3x Less Memory, 4.1x More Throughput, Near-Lossless

A team led by researchers at Meituan's LongCat division, Tsinghua University and the University of Hong Kong has open-sourced OScaR, a KV cache quantization framework that runs attention at INT2 (2-bit) while staying within roughly a point of full-precision accuracy — and cuts decode-time memory by 5.3x versus a BF16 FlashDecoding-v2 baseline. The paper, "OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond" (arXiv:2605.19660), ships with CUDA kernels and is targeted squarely at the memory bottleneck that dominates long-context and multimodal serving.

What OScaR actually does

The KV cache — the stored keys and values every token attends to — grows linearly with context length and is the single largest consumer of GPU memory during decode. OScaR (Omni-Scaled Canalized Rotation) attacks the hardest part of compressing it: Token Norm Imbalance, where a handful of tokens carry outsized magnitude and blow up quantization error at low bit-widths.

It's a deliberately minimal, two-step pipeline. Canalized Rotation applies a fast Hadamard transform (O(d log d), not O(d²)) to redistribute outlier channel energy, and Omni-Token Scaling does sequence-level normalization to balance token norms. Keys are quantized per-channel, values per-token, with a 128-token residual window kept at higher precision. Critically, it's training-free and calibration-free — no fine-tuning pass, no dataset curation.

The numbers

Measured against BF16 FlashDecoding-v2 on Qwen3-8B at a 128K context length on an H20 GPU, OScaR reports up to 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput. On accuracy, it tops every competing 2-bit method on LongBench-E — 41.75% versus the second-best (OTT) at 40.74% on Llama-3.1-8B, against a 43.57% 16-bit baseline — while beating KIVI, QuaRot and TurboQuant+. On Needle-in-a-Haystack it actually edges the FP16 baseline, 96.5% to 96.0%. The kernels build on HadaCore and BitDecoding and run on GPU Tensor Cores.

The "and Beyond" in the title matters for builders: the authors validate it not just on text LLMs but on multimodal (LLaVA-v1.6, Qwen3-VL-4B/8B) and omni-modal models (Qwen3-Omni-30B-A3B), where KV pressure is worse.

Why it matters for builders

HBM is scarce and expensive, and KV cache — not weights — is what caps concurrent sessions and context length on a serving node. A drop-in, training-free method that quarters KV memory lets you serve longer contexts or fatter batches on the same Hopper-class GPUs, directly improving tokens-per-dollar. The fact that this comes out of Meituan's production LongCat team, not a pure-research lab, signals it was built against real serving economics. With code public, the practical question is integration cost into vLLM/SGLang-style stacks — but for anyone running long-context inference at scale, 2-bit KV cache just got a credible reference implementation.

OScaR Pushes KV Cache to 2-Bit: 5.3x Less Memory, 4.1x More Throughput, Near-Lossless

What OScaR actually does

The numbers

Why it matters for builders

More in Research

AI Inverse-Designs New Gallium Semiconductors With Targeted Band Gaps

Iranian APT Built Malware With AI Mid-War: Check Point Flags MiniFast's LLM Fingerprints

DeepMind's AlphaProof Nexus Cracks 9 Open Erdős Problems With Lean-Verified Proofs