ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
78% Positive
Analyzed from 3795 words in the discussion.
Trending Topics
#model#gemma#models#google#https#tokens#qwen#mtp#token#using

Discussion (120 Comments)Read Original on HackerNews
It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.
Caveat: Gemini has been dumbed down a few times over the last year. Rate limits tightened up too. So it might not be this good in the future.
Antigravity seems significantly better in comparison, but with lower usage limits. If I run out, I usually don't bother switching to Gemini CLI.
Even with pro, I have caught it going off the rails a few times. The most frustrating was when I asked it to do translations, and it decided there were too many to do so it wrote a python script that ran locally and used some terrible library to do literal translations, and some of them were downright offensive and sexual in nature. For translations though, Gemini is the best but you have to have it do a sentence or two at a time. If you provide the context around the text, it really knocks it out of the park
https://gemini.google/subscriptions/
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
https://github.com/ollama/ollama/pull/15980
Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0
For someone who's been running local models for a long while, these are very very exciting times.
However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
best is to use your own model router atm, depending on the task
Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
Modem vs Claude according to Claude:
300 @ 2368 characters - 1m 19s
1200 @ 2368 characters - 19.7s
2400 @ 2368 characters - 9.9s
14.4K @ 2368 characters - 1.6s
33.6K @ 2368 characters - 705 ms
56K @ 2368 characters - 447 ms
Claude @ 2368 characters - 7.9s
[0] - https://taalas.com/products/
https://chatjimmy.ai/
They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.
Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?
Other providers hitting capacity and hitting the limits subsidising their inference.
Google strategy seems to be about scaling and distributing these models to their existing billions of users.
The general narrative I would read on HN/others, was that Google would be able to outlast/outcompete OpenAI and Anthropic because Google had both more money and more compute. Playing the game of subsidizing their most capable models to capture market share longer than the VCs could.
But instead I feel like Google opted out of that much earlier. Shifting their focus on efficiency and scaling much much earlier. Flash and Gemma being where Google was actually ahead of the competition while everyone was focused on bigger.
My personal opinion was this was necessary because integrating AI into products like AI overview, search meant scaling to billions of users was a requirement right out of the gate. And theres not enough money/compute no matter who you are to use frontier models for that.
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
Local models are the future it's awesome
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
For gemma4 26B, same quantization, I get >200TPS.
Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average
Not sure why (too amateur sorry).
Though I think qwen was natively trained on toolcalling.
You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.
[1] https://github.com/ml-explore/mlx-lm/pull/990
[2] https://github.com/ggml-org/llama.cpp/pull/22673
All 4 gemma-4-*-it models, regardless whether they are dense models or MoE models, have associated small models for MTP, whose names are obtained by adding the "-assistant" suffix.
https://huggingface.co/google/gemma-4-E2B-it-assistant
https://huggingface.co/google/gemma-4-E4B-it-assistant
https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
https://huggingface.co/google/gemma-4-31B-it-assistant
They're somehow connected to vision & block speculative decode...don't ask me how/why though
For gemma specifically had more luck with speculative using the llama-server route than lm studio
You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.
The blog says something about re-using the big model's data?
Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".
The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.
This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.
They serve gemma-4-26b-a4b-it.
I'm not seeing any update to the app on my android phone... maybe later today?
>We’ve published an in-depth technical explainer
I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-E2B-it-assistant
E4B = 4B effective parameters (using per-layer embeddings)
E2B = 2B (like above)
it = instruction tuned (rlhf and all that jazz)
assistant = Multi-token drafters (the new 2x speed up)
https://www.youtube.com/watch?v=sXgZhGzqPmU
As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway
If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.
Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.
And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.
I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.
I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.
The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.
A poor draft model will simply slow down the process without affecting the output.
This is the crux. What makes the guess "right"?
I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough.
How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters.
> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
Beta but useable
plus over time the harness - coming version has a hotkey for screen capture, next release will have support for native excel, docx export
there is value in being offline by design
Gemma:31b was more accurate but speed was horrendous.