ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
55% Positive
Analyzed from 1373 words in the discussion.
Trending Topics
#power#performance#more#data#faster#speed#gpu#idle#cards#cpu

Discussion (47 Comments)Read Original on HackerNews
I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?
For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.
That aside idle power consumption is a driver-to-driver affair from both amd and novideo, sometimes I'm only pulling 15-30W when nothing is happening and other times it decides it needs 110w for a static 500hz screen
[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...
GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?
Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."
All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.
How much die space ($) will that circuitry, that's probably statistically near zero chance for you main customers workload (who has model weight of 0 or 1!?), add. And, if you can stomach the cost, what else could you put there instead?
I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
https://arxiv.org/abs/2404.00456
When you make it so the computer does not have to compute all possible states of matter it finishes faster.
Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.
This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.
This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.
Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.
The cards don't "perform faster by running slower". If you run the card slower, it performs slower.
~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.
I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.
And for actual runs, from a pre-run sampled curve.
https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...
https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...
https://arxiv.org/html/2604.03279