Rust Threads on the GPU

PPaulHoule 5 days ago 24 commentsRead Article on vectorware.com

RU version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

44% Positive

Analyzed from 2078 words in the discussion.

Discussion (24 Comments)Read Original on HackerNews

nynx•about 9 hours ago

I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.

LegNeato•about 8 hours ago

Founder here.

1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.

2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.

3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.

Anyway, skepticism is understandable and we are well aware code wins arguments.

ghighi7878•about 5 hours ago

Good point about gpu threads being equivalent to warps.

esperent•22 minutes ago

Groups of gpu threads are called warps.

https://modal.com/gpu-glossary/device-software/warp

jzombie•about 7 hours ago

Do you foresee this being faster than SIMD for things like cosine similarity? Apologies if I missed that context somewhere.

LegNeato•about 7 hours ago

It depends. At VecorWare are a bit of an extreme case in that we are inverting the relationship and making the GPU the main loop that calls out to the CPU sparingly. So in that model, yes. If your code is run in a more traditional model (CPU driving and using GPU as a coprocessor), probably not. Going across the bus dominates most workloads. That being said, the traditional wisdom is becoming less relevant as integrated memory is popping up everywhere and tech like GPUDirect exists with the right datacenter hardware.

These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.

(for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)

shmerl•about 8 hours ago

> because GPUs are weird and the tools are weird.

Why is it also that terminology is so all over the place. Subgroups, wavefronts, warps etc. referring to the same concept. That doesn't help it.

adrian_b•about 6 hours ago

This is the fault of NVIDIA, who, instead of using the terms that had been used for decades in computer science before them for things like vector lanes, processor threads, processor cores etc., have invented a new jargon by replacing each old word with a new word, in order to obfuscate how their GPUs really work.

Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.

Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.

So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).

MindSpunk•about 7 hours ago

All the names for waves come from different hardware and software vendors adopting names for the same or similar concept.

- Wavefront: AMD, comes from their hardware naming

- Warp: Nvidia, comes from their hardware naming for largely the same concept

Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get

- Subgroup: Khronos' name for the abstract model that maps to the hardware

- Wave: Microsoft's name for the same

They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.

amelius•about 4 hours ago

> The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack.

These days I just ask an LLM to write my optimized GPU routines.

zozbot234•about 5 hours ago

It looks like they're trying to map the entire "normal GPU programming model" to Rust code, including potentially things like GPU "threads" (to SIMD lanes + masked/predicated execution to account for divergence) and the execution model where a single GPU shader is launched in multiple instances with varying x, y and z indexes. In this context, it makes sense to map the GPU "warp" to a Rust thread since GPU lanes, even with partially independent program counters, still execute in lockstep much like CPU SIMD/SPMD or vector code.

rl3•about 9 hours ago

I think they've taken the integration difficulty into account.

Besides, full redesign isn't so expensive these days (depending).

>It seems like a solution in source of a problem.

Agreed, but it'll be interesting to see how it plays out.

kevmo314•about 7 hours ago

Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.

lmeyerov•about 6 hours ago

We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.

We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.

Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!

imtringued•about 3 hours ago

I've seen this objection pop up every single time and I still don't get it.

GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

Consider the following:

You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.

Bimos•about 3 hours ago

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

But the library is using a warp as a single thread

monideas•about 3 hours ago

I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?

gpm•about 7 hours ago

Is this proprietary, or something I can play around with? I can't find a repo.

LegNeato•about 7 hours ago

It is not, we just haven't yet upstreamed everything.

20k•about 6 hours ago

This programming model seems like the wrong one, and I think its based on some faulty assumptions

>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur

This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!

>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution

This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality

GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast

This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance

>Each warp has its own program counter, its own register file, and can execute independently from other warps

The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest

The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!

david-gpu•about 4 hours ago

> Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level

Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

> Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

imtringued•about 3 hours ago

>The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each thread has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.

From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.

What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.