ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
64% Positive
Analyzed from 1117 words in the discussion.
Trending Topics
#cache#context#model#more#vram#token#tokens#attention#might#faster

Discussion (46 Comments)Read Original on HackerNews
Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta
And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice
___
If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.
Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.
That would be massive. Especially since the thing has so much compute to spare.
Though, all depending on the size of that predictor model I guess?
KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem.
Think about the problem from first principles:
Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem.
I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...
While this might seem to be true for casual users, I recall that one of the reasons for Anthropic's recent changes for only retaining KV cache for an hour or so, was that many users just have one massive ongoing session that they continue on with multiple unrelated queries (as one would in a single-thread "group chat"). And this is hard to distinguish from someone who wants that context for their seemingly-unrelated query to apply tone etc.
So in practice, there are many casual users who are typing their Google-esque searches against a 100k+ token context window - and it's at that point where things balloon into 300GB+ KV caches to maintain.
I wouldn't be surprised if we see new UX's around subsidized plans starting to encourage resetting the context window more often.
The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.
For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so
MoE is more hardcoded, pre determined, speculation is much more dynamic, malleable after training.
This paper actually proposes direction of aligning architecture to aid speculation as future work.