Google has published TurboQuant, a compression algorithm that reduces LLM key-value cache memory by at least 6x while delivering up to 8x inference speedup on NVIDIA H100 GPUs — all without any measurable loss in model accuracy. The research, set to be presented at ICLR 2026, has already sent shockwaves through financial markets, dragging down memory chip stocks across the board.
How TurboQuant Works
The key-value (KV) cache is one of the most expensive bottlenecks in running large language models. It stores context information so the model doesn't have to recompute it with every new token it generates. As context windows grow larger, the KV cache memory requirement explodes, driving up hardware costs.
TurboQuant tackles this in two stages. The first stage uses PolarQuant, a method that converts high-dimensional data vectors from standard Cartesian coordinates into polar coordinates consisting of a radius and a set of angles. This reimagining of the coordinate space enables far more efficient compression.
The second stage applies a small amount of additional compression power — just 1 bit — using the QJL algorithm to eliminate bias, resulting in more accurate attention scores. Together, these steps compress the KV cache to just 3 bits per value, down from the standard 16, without requiring any model training or fine-tuning.
Market Shockwaves
The announcement rattled memory and storage stocks. SanDisk Corporation fell 5.7%, SK Hynix slid 5.9%, Samsung dropped 4.8%, Western Digital declined 4.7%, Seagate slipped 4%, and Micron Technology fell 3%. Investors interpreted the breakthrough as a potential threat to memory hardware demand in AI data centers.
However, analysts urged caution. The demand picture for AI memory remains strong, and compression algorithms have existed for years without fundamentally altering procurement volumes. The real impact, some argued, is on the cost curve rather than on total memory shipments.
Beyond LLM Inference
TurboQuant's applications extend beyond language model inference. In vector search workloads, indexing time drops to virtually zero — 0.0013 seconds for 1,536-dimensional vectors compared to 239.75 seconds for conventional product quantization. This could significantly reduce the cost of retrieval-augmented generation (RAG) pipelines and embedding-based search.
Community Response
Google has not yet released official code, but independent developers have already built working implementations from the paper's mathematics, including versions in PyTorch, MLX for Apple Silicon, and C/CUDA for llama.cpp. TechCrunch noted that the internet has already dubbed TurboQuant the real-life "Pied Piper" — a reference to the fictional compression company from Silicon Valley.
PolarQuant is separately scheduled for presentation at AISTATS 2026, and Google validated the combined approach across multiple benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval.


