ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
67% Positive
Analyzed from 1037 words in the discussion.
Trending Topics
#gpu#bubble#pipeline#term#here#more#model#blog#title#cpu

Discussion (31 Comments)Read Original on HackerNews
To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
I guess the difference here being that we have ample compiler literature and practically know 99% of all there is to know about compilers that exist in the wild vs this new field.
Until we’ve gathered and agreed on a few “dragon books” for LLMs and have explored all there is to LLMs, you’re probably right - know-how will be with the practitioners and in source code until it’s distilled (pun intended).
First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
Can you explain what you mean here? Are you talking about small neural networks doing specific tasks?
In general, for some reason CODEX loves CUDA-streams, it's the first optimization it goes for every time when writing GPU kernels. However in many cases this is not a bottleneck, it happens to be so here because the model in the blog is small (2.4ms FW-pass is tiny, and 9B params sit on a single GPU). Large models are closer to 30-40ms. The CPU-GPU sync is 1-2ms, when working on larger MoE models the scheduling of tokens in this way is much less important than for example scheduling of computation/communication or kernel optimization.
I wish the blog would state this at the start with the premise of what has been done, or show that this is indeed the bottleneck with some benchmarking. Otherwise is kind of overselling things imo.
If you scroll down to the section titled "A cost model for the bubble", you will find both benchmark results and us saying, "you get back anywhere from a few percent to a third; more the faster your accelerator/model is".
This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852
This is true, but I've never heard anyone refer to this as a GPU bubble before.
I think most people hear "GPU bubble" and think of a financial bubble of some kind.
Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.
Better term, anyone?
The GPU would be the propeller, the influx is the work, and the operational parameters is what this article's about.