Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
54% Positive
Analyzed from 1794 words in the discussion.
Trending Topics
#support#more#cpu#intel#run#instructions#performance#avx#detection#code
Discussion Sentiment
Analyzed from 1794 words in the discussion.
Trending Topics
Discussion (44 Comments)Read Original on HackerNews
This seems like a strange thing to say. Fine grained feature detection was around long before "microarchitecture levels" and never went away. The microarchitecture levels were introduced because they were easier to use.
[0] https://github.com/ronnychevalier/cargo-multivers
[0] https://doc.rust-lang.org/stable/std/macro.is_x86_feature_de...
But yeah no, on the whole cost of the checks and duplicated binary size aren't seen as worth it, so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.
Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.
[0] https://www.phoronix.com/review/clear-linux-48p-ubuntu/6
For many other things, like using a YMM register to copy a 32-byte struct or a variable shift, run-time dispatch just not make sense. You will only see a benefit if you generate this code unconditionally. For FMA, you wouldn't even get bit-identical output, leading to testing concerns.
the thread is about runtime detection tbf
The same conclusion: v2 as baseline, v3 where possible.
I'm really surprised it's not standard in every toolchain to support arch levels like this today.
Some compilers like Clang allow multiple arch versions in one binary, runtime dispatched. I would love to implement this in our toolchain too.
[0] Please forgive the SEO-style title, it's, well, to get search engines to recognise what's in the article: https://blogs.remobjects.com/2026/01/26/fast-math-in-six-lan...
It's not entirely free; the cost is that the resulting binary will no longer run on processors that lack the instruction. Which, admittedly, is ≈2007 or older. But still! I have a 2012 CPU still in service, and as much as I'd love to obsolete it, gestures at the price tag of RAM these days.
… a 2012 CPU is surprisingly competitive relative to today's tech, too, I'd add. The gap between 2012 and 2026 is nothing compared to the equivalent gap between 1998 and 2012: 1998 is like 500MHz single-core, 32-bit. 2012 is 4 core, 8 hyper threads, 64-bit, 3.5 GHz. (… perhaps more remarkably, my next-oldest machine, a 2017 laptop, is only 2.8 GHz, with the same 4(/8) cores. It also uses like half the power, too. That's mostly the "laptop" bit, though.)
(That same CPU is also incapable of "v3".)
I suspect that heavily optimised code either uses intrinsics or carefully written assembler code.
Ubuntu started allowing defaulting to v3 packages, and I opted in. I already use the -C native to enable AVX512 when compiling binaries for local use. This matters a lot for compute/analytics workloads in my experience.
Speaking of Dr Lemire's suggestion of a V5 architecture level, would that make any sense given the fragmentation of AVX512? None on Intel consumer devices, but it is on the last few generations of AMD.
All the CPUs introduced after Ice Lake (Q3 2019), with the exception of Cooper Lake (Q2 2020; a server CPU with a modest installed base), which support any kind of AVX-512, support all the AVX-512 subsets of Ice Lake (which has very important additions over V4).
This includes all AMD Zen 4, Zen 5 and Zen 6 CPUs, which form the bulk of the non-server CPUs that support AVX-512. Thus 6 years have passed since the introduction of an AVX-512 CPU that is not compatible with Ice Lake (and 7 years since any such CPU that was in widespread use).
Both Intel and AMD have stated that from now on features will be added to AVX-512 (a.k.a. AVX10), not deleted, which will allow in the future the testing of the AVX10 version number to be sufficient for determining CPU capability in this domain.
It would make sense to define a V5 level that includes all instructions of Ice Lake and also a V6 level, corresponding to AVX10.1 (Intel Granite Rapids) or to AVX10.2 (Intel Diamond Rapids).
I wonder if this is a natural law, or emergent behavior of complex systems?
https://go.dev/wiki/MinimumRequirements#:~:text=The%20Go%20t...
Most of the recent additions in processor instruction sets are intended for relatively niche applications.
In such cases, other applications will not be affected at all, but the specific application that is the target, for example a certain cryptographic algorithm or AI inference, may be accelerated many times when using the new ISA version instead of the old ISA version.
Moreover, it is frequent that compilers are not smart enough to take advantage of such ISA extensions, so it is not enough to change the compilation flags, but you need to rewrite some library to get the full performance benefit. For example, many recent x86_64 CPUs have IFMA instructions (integer fused multiply-add instructions), which allow the use of the floating-point multipliers for doing arithmetic operations with big integer numbers (the advantage is that modern CPUs have many more FP multipliers than integer multipliers). This can accelerate a lot the computations with big numbers, but you need a complete carefully-written library that uses such instructions, you cannot just recompile some programs for making them run faster.
From time to time it may still happen that some ISA extension has a wider applicability, being able to accelerate many applications, possibly just by recompilation, like Intel hopes to happen with the APX extension that will arrive early next year, in the Intel Nova Lake and Diamond Rapids CPUs.
Most non-professional computer users are biased toward single-threaded application performance, where diminishing returns have already been seen for more than 2 decades.
On the other hand for multi-threaded application throughput, we have not reached yet any diminishing returns. The throughput per CPU socket has continued to increase in geometric progression every year until now. The only serious problem is that starting around 10 years ago, from the days of Intel Kaby Lake and Coffee Lake, the price of computers has started to increase and the increase rate has accelerated recently.
So now the possible throughput for a given computer size becomes less and less relevant in comparison with the throughput per dollar, and for the throughput per dollar it appears that we have already entered the region of diminishing returns (i.e. with unlimited budget you can still buy computers whose throughput increases in geometric progression each year, but the computers that you can actually still afford have a throughput that increases much slowlier).
If you are a software vendor, or even just a contributor to some open-source program, you must make some compromise between program performance and its ability to run without modifications on an as large number of computers as possible.
Therefore you must either avoid any features available only in newer computers, or you must have some kind of processor capability detection at run time, followed by the selection of appropriate program variants.
You might not afford to prepare enough program variants, so it is likely that you would still choose to not support the most recent computers.
Edit: to address your literal remark: so even the title is correct, if you think of a programming language as more than its syntax.
Go's selling point is definitely not performance.