FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
100% Positive
Analyzed from 218 words in the discussion.
Trending Topics
#more#run#turboquant#eden#quantization#same#models#maybe#optimal#scale

Discussion (8 Comments)Read Original on HackerNews
We were the first to introduce post-rotation distribution aware quantization in 21, which was LATER implemented in many fields including federated learning, vector retrieval, databases, inference engines, and KV-cache.
It would be nice to get some credit of this. And it is certainly baffling to see the name "TurboQuant" repeated in this context, considering the many works from 21 onwards.
The blog post above basically goes you through EDEN quantization, but then ends settling with a less than optimal MSE-minimizing version and an unbiasing trick that often costs a full bit more than DRIVE/EDEN need for the same results (with the unbiasing scale, shown in the original 21 paper).
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.