Google's TurboQuant Crashed the AI Chip Market
TL;DR
Google’s TurboQuant claims a rare trifecta: 6x less KV-cache memory, up to 8x faster cache ops, and zero accuracy loss — Wes says that combination is what made this feel different from the usual “faster but worse” compression story, especially since Google tested it on Gemma, Mistral, and Llama running on Nvidia H100s.
The core trick is PolarQuant, which rewrites memory vectors like direct pointing instead of block-by-block directions — Wes uses Google’s own analogy: instead of “go 3 blocks east and 4 north,” the model stores “go 5 blocks at a 37° angle,” preserving strength as radius and meaning as direction.
TurboQuant is inference software, not a new chip or retraining method, so the payoff could be immediate — according to Wes, enterprises running LLMs at scale could see something like a 50% cost reduction on production inference, cheaper API calls, and larger context windows on the same hardware.
The market instantly treated this like a threat to AI memory demand, hammering chip stocks — he cites SK Hynix down 6%, Samsung down 5%, SanDisk down 5.7%, Western Digital down 4.7%, and Micron down 3% after the news, feeding the “this crashes Nvidia’s memory business” narrative.
Wes thinks the better frame is Jevons paradox, not collapsing chip demand — if inference gets cheaper and context gets longer, companies won’t politely buy fewer GPUs; they’ll run more models, more agents, and much weirder use cases, the same way cheaper gas makes people drive more, not less.
His closing meta-point is that Google keeps publishing foundational work the whole ecosystem benefits from — he ties TurboQuant back to ‘Attention Is All You Need,’ arguing that many of today’s AI companies exist partly because Google openly released breakthroughs it could have kept as a private cost advantage.
The Breakdown
The “bloodbath” headline, minus the fake Weissman joke
Wes opens with the big claim: Google’s new TurboQuant cuts some memory needs by at least 6x, speeds things up by as much as 8x, and somehow does it with zero accuracy loss. He jokingly tosses in a fake “validated Weissman score of 5.2” and catches himself with a Silicon Valley TV-show gag before getting serious: the stock market reaction was real, and memory-chip names got smoked.
Why KV cache matters in the first place
To explain what Google actually did, he walks through how transformers figure out what a word like “it” means only from context. His “animal didn’t cross the street because it was too tired” example is the human version of KV cache: the model stores little labeled folders so it can later retrieve the deep relational meaning without rereading everything.
PolarQuant: from street directions to pointing at the building
This is the key analogy of the video. Instead of describing a memory vector in standard Cartesian steps — go this much across, this much up — PolarQuant represents it like pointing directly at the target with an angle and a distance, which Google says makes the data more compressible and predictable.
The grandma chart and the joke that only works with memory
Wes turns the math into a semantic map: gender on one axis, age on another, so “grandmother” and “grandparent” become positions you can almost infer by where he points. He circles back to Google’s line, “a new angle on compression,” and points out that the joke only lands because your brain retrieves the earlier context — basically its own KV cache.
The numbers that make this more than a neat paper
On open models like Gemma, Mistral, and Llama, running on Nvidia H100s, Google reports a 6x KV-cache memory reduction and an 8x speedup for that retrieval process. Wes stresses that this is not “the whole model is 8x faster,” but it’s still a huge deal because the expensive part got dramatically cheaper without the usual JPEG-style tradeoff in quality.
Why he thinks this could cut inference costs in half
His practical takeaway is that this looks like production leverage, not lab-only magic: no retraining, no fine-tuning, just swap in the method and get cheaper inference. He frames it as roughly a 50% cost reduction for enterprises, plus longer context windows, more requests per second, and more headroom for agents, long documents, and large codebases.
The second ingredient: QJL as the tiny error corrector
Wes says “TurboQuant” is the umbrella, with PolarQuant doing most of the compression heavy lifting and a quantized Johnson-Lindenstrauss step cleaning up whatever tiny error remains. That second piece is what helps preserve the headline-grabbing “zero accuracy loss” result rather than merely “close enough.”
The stock crash take versus the Jevons paradox take
After the announcement, SK Hynix, Samsung, SanDisk, Western Digital, and Micron all dropped as traders assumed lower memory needs means lower chip demand. Wes isn’t buying that straight-line logic: his view is classic Jevons paradox — make inference cheaper and people will flood the system with more models, more tokens, more agents, and stranger use cases, not less.
Why Google may be the biggest winner — and why users win too
He argues this is especially great for Google because every efficiency gain on its massive infrastructure drops straight to margin, much like the earlier DeepSeek moment reset assumptions. But he ends on a broader point: users and power users benefit too, because if token costs fall, companies can subsidize more usage, and the real unlock is that Google published the work at all, just as it did with “Attention Is All You Need.”