HI version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
0% Positive
Analyzed from 206 words in the discussion.
Trending Topics
#offload#gpu#help#cpu#moe#makes#vram#didn#read#layers

Discussion (4 Comments)Read Original on HackerNews
For example, `--cpu-moe` makes it not offload MoE layers to the GPU, which drops performance about a quarter, but only keeps the dense and important layers on the GPU, allowing you to have MoE models bigger than VRAM almost for free, but also free up room in VRAM for more KV cache. It does nothing on CPU-only.
`--no-kv-offload` also does nothing here: it makes it not offload KV cache to VRAM... he doesn't have a GPU to offload to, and this is the default there.
Again, `-sm` is only for multi-GPU. No GPUs here.
`--mla-use` is for models that use Deepseek's Multi-Head Latent Attention. Gemma 4 is not one of them.
`--merge-up-gate-experts` reduces matrix math complexity around ffn_up and ffn_gate tensors; CPUs do not have tensor units and this is unlikely to actually help.
MTP is also never faster on CPU-only, and this is documented. ngram-mod, however, may help, which it doesn't look like he tried.
This whole screed also reads like it was written by AI.
> The honest caveat, because it matters:
> This one I got right in the original, and now I have the number to back it.
Thanks Claude.
“The X was Y”
for non-trivial concept X, without any previous attempt to introduce or explain what it is. It reads like it’s intentionally trying to bamboozle the reader