FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
71% Positive
Analyzed from 3256 words in the discussion.
Trending Topics
#tokens#per#second#speed#models#token#more#thinking#code#slow

Discussion (94 Comments)Read Original on HackerNews
On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.
Calling the token rate the rate at which they "type" is a bit misleading. They also do virtually all of their more complex reasoning in tokens, so 5 tokens per second is also their thinking speed. And thinking at 5 tokens per second is glacially slow.
This is why faster versions of strong models do so well on reasoning tasks like playing text adventure games[1]. Their output isn't better on a token-for-token basis, but they get so much more thinking in during a given time window, they get more opportunities to find the right conclusion.
[1]: https://entropicthoughts.com/updated-llm-benchmark
There is no way you can follow what is going on even at 30 tokens per second. Maybe you can maintain a rough idea of what is going on for some tens of seconds but that is probably about it. Follow it in any detail, no chance. Reason about what you read, absolutely no chance.
800 tok/s — Cerebras-class, where the bottleneck is your eyeballs
I do not understand why they say this. I am not sure if it is even true. 800 tokens sounds like a page of text and I would assume you can look at one page per second without hitting any limitation of your eyes. Or is the resolution of the human not good enough to see an entire page at once and you have to scan it with the fovea? Scrolling text might of course hit the temporal resolution limit. But why does this even matter, your brain can not process anything close to the amount of information your eyes can take in.
Click on 800.
Try to read the text.
You'll understand.
EDIT: As others have pointed out and I now did some reading on, it is an illusion that you can see all the text on a page at once, that is beyond the resolution limit of the human eye. To actually see all the words, you have to scan the page and that takes several seconds. From the numbers I have seen, it seems that the ultimate limit is probably below 30 tokens per second, no matter what, even using rapid serial visual presentation to cut out eye movements. Even 10 to 20 tokens per second is probably pushing it and unsustainable for many, if not most, people.
You also would need to compare token generation not with the actual output, but with the thoughts and deleted and edited parts.
100wpm might still bit a bit high even for your average programmer.
> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.
I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.
I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
Something like https://chatjimmy.ai/ will do 14,000 tokens per second, and it's a completely different experience to what we see now. It's more like a page load than a conversation.
I'd much rather trade speed for accuracy.
This is not a realistic replay of what a common LLM might actually print out - it's entirely fabricated. But for the purpose of estimating the feel of tokens per second, I suppose it's good enough.
[1] https://dave.ly/tools/tokenflow/
[2] https://platform.openai.com/tokenizer
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.
The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)
So you can easily follow the 1000 tokens of code, and the 18000 tokens of thinking is you sitting around waiting for your GPU to process the LLM.
Since the whole goal of software architecture schemes it to allow the rest of us non-geniuses to still understand it and modify it, perhaps the same could be true of llms.
Perhaps a million-per-second hypothetical (small) model can be more useful than a state of the art big one.
Maybe when intelligence plateaus it could become a main differentiating factor, like smartphones and battery life.
For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.
I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)
One can say the most profound thing in 3 words, slowly or fast it does not even matter, and can also spout absolute senseless garbage in billions of words at absolutely ridiculous speed.
Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.
I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!
I use Haskell because I find laziness to be a super power. I can solve so many problems in the most straightforward way, and then laziness saves my butt w.r.t. performance.
I use Haskell because it is a better C than C is. The foreign function interface is brilliant, and I can take C primitives and apply all the abstraction mechanisms from Haskell to them. My latest project has been OpenGL based, so lots of caring about byte alignments and shovelling data to the GPU. But all this can be automated with clever use of type classes and Generics (Haskells super cool meta system of data types.)
I use Haskell because I love applying abstractions to make code which describes the problem, and then the compiler finds the solution.
I don’t do programming for embedded, so I am rarely memory constrained. I also understand Haskell memory usage quite well, and can get myself out of trouble.
It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.
15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem
Feedback loops for prototyping could become even quicker.
Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.
Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.
But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!
[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]
This was also what I used at the time, the Qwen 3 Coder 480b on Cerebras. Worked great and was so stupidly fast it made me realize that if the hardware can be at that level and commercially available (say in a 5~10 years), for that price, then we will have entirely new bottlenecks. Human review at the pace it was going is completely impossible.
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem
I don't see a big difference.