Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

0% Positive

Analyzed from 206 words in the discussion.

Trending Topics

#offload#gpu#help#cpu#moe#makes#vram#didn#read#layers

Discussion (4 Comments)Read Original on HackerNews

DiabloD3about 2 hours ago
I suspect this person didn't read the README or the `--help` at all.

For example, `--cpu-moe` makes it not offload MoE layers to the GPU, which drops performance about a quarter, but only keeps the dense and important layers on the GPU, allowing you to have MoE models bigger than VRAM almost for free, but also free up room in VRAM for more KV cache. It does nothing on CPU-only.

`--no-kv-offload` also does nothing here: it makes it not offload KV cache to VRAM... he doesn't have a GPU to offload to, and this is the default there.

Again, `-sm` is only for multi-GPU. No GPUs here.

`--mla-use` is for models that use Deepseek's Multi-Head Latent Attention. Gemma 4 is not one of them.

`--merge-up-gate-experts` reduces matrix math complexity around ffn_up and ffn_gate tensors; CPUs do not have tensor units and this is unlikely to actually help.

MTP is also never faster on CPU-only, and this is documented. ngram-mod, however, may help, which it doesn't look like he tried.

This whole screed also reads like it was written by AI.

usernamed7about 3 hours ago
> I am telling you the count because the count is the point.

> The honest caveat, because it matters:

> This one I got right in the original, and now I have the number to back it.

Thanks Claude.

dwrobertsabout 1 hour ago
What makes it particularly bad too, is it does a style of saying

“The X was Y”

for non-trivial concept X, without any previous attempt to introduce or explain what it is. It reads like it’s intentionally trying to bamboozle the reader

mmmpetrichorabout 2 hours ago
AI didn't read. (AIDR?)