RISC-V and Floating-Point

hhasheddan 2 days ago 48 commentsRead Article on fprox.substack.com

ES version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

47% Positive

Analyzed from 1866 words in the discussion.

Discussion (48 Comments)Read Original on HackerNews

taraharris•about 1 hour ago

I think it's a good thing to not bake IEEE 754 too deeply into RISC-V.

I really want to see hardware built around posits. This is not because they're necessarily superior to floats (they aren't in all use cases), but just because we need some diversity. Too much standardization is bad for innovation, and not everything that's settled should remain that way.

https://www.sigarch.org/posit-a-potential-replacement-for-ie...

Netch•2 days ago

From a bystanderʼs POV it is excessively hard to memorize all the mess with multiple different extensions. The naming style doesnʼt alleviate the task. But this is a typical issue in the whole RISC-V ecosystem.

What Iʼm slightly confused for is that all these extensions, useful for a minor part of applications, arenʼt moved to longer instructions (6-byte).

rwmj•about 12 hours ago

I agree the naming is annoying. However the extensions aren't too hard to understand. They are much more regular than on other architectures. I wrote a deep dive into them here: https://research.redhat.com/blog/article/risc-v-extensions-w...

Also groups of extensions are consolidated into Profiles, so in practice you don't really care about individual extensions. You'll only care that the hardware supports eg RVA23.

brucehoult•2 days ago

The average bystander doesn't have to care, just buy a machine implementing the RVA23 profile (standard set of extensions) and be happy.

If you're building your own embedded hardware then you determine what your needs actually are e.g. do you need double precision? half precision? vector?. Then you choose a chip implementing that. Then you copy the ISA string from your chip's documentation to the `-march=` argument for GCC/Clang and be happy.

It's not hard and you don't have to think about it unless you very specifically want to.

fweimer•about 11 hours ago

How does the average bystander buy an RVA23 machine today?

self•about 10 hours ago

You buy a https://www.spacemit.com/products/keystone/k3 board. A few links:

- https://sipeed.com/k3

- https://milkv.io/jupiter2-dev-kit

More here: https://redd.it/1tcrx4o

bjourne•about 12 hours ago

The average bystander might want to write high-performance code for their risc-v cpu. Then they must know precisely which instructions are available and what the performance implications of using them are. E.g., the difference between a shared and non-shared fp register file is huge.

brucehoult•about 8 hours ago

If you want to get the absolute most out of a specific CPU that is in your hands then you of course have to refer to the documentation for that specific CPU.

That process doesn't depend on whether it's an x86 or an Arm or a RISC-V.

That's why x86 people refer to the HUGE document maintained by Agner Fog.

If you want your code to run well on all standards-compliant implementations then you write according to the ISA documentation, in this case RVA23. Or ARMv9-A. Or x86_64 v3.

rwmj•about 12 hours ago

For the "average bystander" they're going to buy an OS and compatible hardware, or if they're the average programmer they're going to use a compiler and libraries that solve the problem already for them. Very very few people need to worry about the details.

LoganDark•about 8 hours ago

Do high-performance RISC-V CPUs (that you can actually buy) still exist? SiFive Unleashed was great but IIRC it was a single batch that never returned.

panick21_•about 12 hours ago

You think the average person writes performance optmized code?

If you are on that level then you know pretty well what you are targeting. And even then in 99% of cases you just look at the top level profile.

If you do performance analysis for some specific embeded project that is not using a standard profile, then its a bit more work, but hardly impossible.

camel-cdr•2 days ago

> From a bystanderʼs POV it is excessively hard to memorize all the mess with multiple different extensions

It's the same for other ISAs.

> What Iʼm slightly confused for is that all these extensions, useful for a minor part of applications, arenʼt moved to longer instructions (6-byte).

Because these instructions don't need it. There will be future >4-byte instructions, for things thay can't resonably be done in 4-bytes, e.g. much larger immediates.

pclmulqdq•about 11 hours ago

It's way worse on RISC-V. There are maybe 5 x86 or ARM variants to care about at any given time, even if you want to hyper-optimize your code. RISC-V has a soup of literally 100s of extensions with non-uniform use and support.

camel-cdr•about 9 hours ago

There are a lot more ARM extensions than people are aware of. E.g. debian uses ARMv8-A with FEAT_FP and FEAT_AdvSIMDas a base. Yes, floating-point and SIMD are optional in ARMv8-A, as are the following ISA extensions, only including ones that add instructions and excluding the AArch32 stuff: FEAT_CRC32, FEAT_AES, FEAT_PMULL, FEAT_SHA1, FEAT_SHA256, FEAT_RDM, FEAT_F32MM, FEAT_F64MM, FEAT_I8MM, FEAT_LSMAOC, FEAT_SHA3, FEAT_SHA512, , FEAT_SM3, FEAT_SM4, FEAT_SVE, FEAT_EPAC, FEAT_FCMA, FEAT_JSCVT, FEAT_LRCPC, FEAT_DotProd, FEAT_FHM, FEAT_FlagM, FEAT_LRCPC2, FEAT_BTI, FEAT_FRINTTS, FEAT_FlagM2, FEAT_MTE, FEAT_MTE2, FEAT_RNG, FEAT_SB, FEAT_BF16, FEAT_DGH, FEAT_EBF16, FEAT_CSSC, ...

Also fun: FEAT_LittleEnd, FEAT_MixedEnd, FEAT_BigEnd

All of that was just 64-bit ARMv8.x-a, there is a lot more stuff, once you go to R or M profiles, 32-bit and previous versions.

The reason this is mostly not a problem, is that distros converged on a minimum of 64-bit ARMv8-A + FP + SIMD, which will also happen with RVA23 for RISC-V.

Just for fun, here are the Zen4 ISA flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3 dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm

Compared to RVA23 written out: rv64imafdcbv_zicsr_zicntr_zihpm_ziccif_ziccrse_ziccamoa_zicclsm_zic64b_za64rs_zihintpause_zba_zbb_zbs_zicbom_zicbop_zicboz_zfhmin_zkt_zvfhmin_zvbb_zvkt_zihintntl_zicond_zimop_zcmop_zcb_zfa_zawrs_svbare_svade_ssccptr_sstvecd_sstvala_sscounterenw_svpbmt_svinval_svnapot_sstc_sscofpmf_ssnpm_ssu64xl_sha_supm_zifencei

addaon•about 5 hours ago

> There are maybe 5 x86 or ARM variants to care about at any given time

What? There are individual chips with nearly that many ARM variants, including incompatible ISAs (M0 vs R52) and compatible-but-very-different-performance-characteristics implementations of the same ISA (M4 vs M7, say). Even figuring out what portion of code can be shared across which cores (and for those that distinguish between ARM and Thumb mode, what mode that code can be called in), vs what code needs duplicate versions for different cores for correctness, vs what code needs duplicate versions for performance but not correctness (which changes as the code usage pattern evolves) can be a challenge on a single chip; I can't imagine a world where you can think about only five across an entire industry.

benj111•about 10 hours ago

What are you imagining? If this is desktop then most of the extensions are going to be standard.

The only reason they're optional is because I'm using the same instruction set on my Pico, so no it doesn't have floating point, and I believe it has integer divide but I wouldn't be surprised if it didn't.

And the extensions are in groups, a good chunk of which are compressed instructions, which unless you're writing assembly, you don't need to worry about.

In fact most of this you don't need to worry about unless youre writing assembly.

wg0•about 10 hours ago

> It's the same for other ISAs.

No they are not. See the Intel Software Programmer Volumes. Highly detailed, highly structured and highly specific.

Joker_vD•about 10 hours ago

You're joking, right?

fidotron•about 9 hours ago

From the point of view of the RISC-V architects the "users" are the chip designers who are engaged in a sort of build-your-own-instruction-set situation, and this kind of makes sense, but does contribute to it being a mess.

They are absolutely in denial as to the downstream effects of this on the software ecosystem. Android, for example, for native support had enough fun dealing with relatively few ARM variants (and x86/MIPS etc), and identifying chip features at run time was reliant on the board support software getting it right (hint: it didn't).

pjmlp•about 9 hours ago

It is like Khronos APIs but in hardware, design by comittee at its best.

NooneAtAll3•about 9 hours ago

looks like there's no mention of soft-floats support, like in Hazard3 cores

see f.e.: https://wren.wtf/shower-thoughts/marks-magic-multiply/

Narishma•about 2 hours ago

Because that's not a RISC-V standard but a custom extension implemented by one chip.

mavdol04•about 7 hours ago

Working on a RISC-V emulator targeting Wasm. Is RVV 1.0 stable enough to be worth implementing, or would Zve32f/Zve64d already cover most use cases ?

andrepd•about 11 hours ago

Some of the complexity that comes with this really comes from the complexity of IEEE734 itself, plus the fragmentation of alternatives at lower precision.

I would have loved if the article mentioned the efforts at integrating Posits [0] in risc-v. While IEEE734 compatibility will obviously be necessary for any foreseeable future, it would be nice if the industry could settle on a better alternative which avoids many of the flaws with IEEE floats.

[0] https://github.com/andrepd/posit-rust

jcranmer•about 6 hours ago

The takes I've seen from pretty much all the numerical analysts is that posits are not really viable as a replacement for floating-point; they're just a much worse implementation. There are a lot of flaws with IEEE 754 (I'm upset that NaN != NaN, e.g., and sNaNs seem to be universally considered a design mistake), but the underlying theory behind it is very well worked-out, and even its more controversial features (like denormals) turn out to have been vital innovations.

One of the issues with posits from a numerical perspective is that it's main claim to fame of being more accurate is handled by having the number of precision bits being dependent on the exponent value, which means it's a return to the days of having to scale your input so that the values are around unity to maximize, rather than the promise of floating-point of being scale-independent.

andrepd•about 4 hours ago

> The takes I've seen from pretty much all the numerical analysts is that posits are not really viable as a replacement for floating-point; they're just a much worse implementation

Care to elaborate? All the analyses I've seen show that posits achieve generally higher precision at the same number of bits compared to IEEE floats. You don't even necessarily need to care about scaling your inputs to be close to ±1; most problem domains already naturally have this!

The "quire" accumulator also enables new kinds of algorithms that are just not possible with IEEE (technically one can have a fixed-point accumulator for IEEE floats, but it's not as practical).

jcranmer•about 1 hour ago

See e.g. https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinech...

SideQuark•about 7 hours ago

> which avoids many of the flaws with IEEE floats

... by repeating lots of the flaws that led to IEEE754, requiring extra accumulators (the "quire") and hardware to do basic ops since the posit format alone fails, and making numerical analysis a complete mess, breaking the ability to write correct numerical algorithms.

They lose precision over large dynamic ranges, making algorithms fail on many inputs, without extreme care (and loss of accuracy over such ranges), lack of NaN/inf makes them fail on lots of other issues (and there are algorithms requiring NaN and inf behavior under IEEE754 for performance - I'll list one I recently made below...), this lack makes it harder to debug where algorithms broke, costing development time, ....

The algo I recently developed needed to find extrema of cubics over a finite range. This requires solving a quadratic. A quadratic root solver can have /0 = inf and sqrt(-) = NaN cases, which are often fiddled with using branches.

In my case I knew I'd be doing these in batches, and wanted C/C++ code to auto vectorize and do them in SIMD, and did not want to pay the cost for branches. This speed up the flow by about 8x on almost all larger processors, at the cost of some slots having NaN or inf. Those with NaN or inf had underlying cubics I could discard. So by using the IEEE754 aware multi parallel root finder (written in strd c++), I could check that the roots were in my interval (also parallelized) as a <= root <= b, which fails for root being NaN or inf. This check is also parallelzied.

All in standard C++, no hint of parallelization intrinsics, handled by modern compilers perfectly, and getting massive speed gains.

This is but one place NaN and inf are extremely useful. This type of use appears all over in scientific computing, graphics, pysics sims, etc.

Posits cannot handle this type of stuff.

dadoum•about 6 hours ago

Correct me if I am wrong, but your algorithm looks to me like it would work fine with posits? You would just get a NaR value instead of both NaN and inf, because div-by-zero and sqrt(-) both yield NaR (the difference here with floats is that it is unique and can get compared to NaR), and so it just works fine?

andrepd•about 4 hours ago

To be honest I'm a bit confused by your comments but the parts I can understand show some misunderstandings.

> lack of NaN/inf makes them fail on lots of other issues

You comment repeats several times assertions of the form "fails on lots of stuff", but does not really elaborate further x) But that aside.

Posits do have a NaN value, with the difference being that only 1 bit pattern is reserved to it instead of potentially quadrillions of them. This makes a huge difference especially at lower precisions, where NaN bit patterns waste an appreciable amount of the total set of values. They do not have inf, they simply do not overflow.

> In my case I knew I'd be doing these in batches, and wanted C/C++ code to auto vectorize and do them in SIMD, and did not want to pay the cost for branches. This speed up the flow by about 8x on almost all larger processors, at the cost of some slots having NaN or inf. Those with NaN or inf had underlying cubics I could discard.

Can you point to your code? I'd be very very surprised if it somehow cannot be rendered using posits.

Pathological cases do exist, of course. Nobody claims that posits are uniformly better than IEEE floats, such a claim being obviously, trivially nonsensical.

> All in standard C++, no hint of parallelization intrinsics, handled by modern compilers perfectly, and getting massive speed gains

Speed gains over what? x) I don't get it. Over branchy or scalar IEEE code? If yes, what's the relevance of that?

And yes, a 40-year old format with decades of hardware and compiler support works better in practice than any conceivable alternative... because all such alternatives are proof-of-concepts with no hardware or compiler support! There's no commercial hardware of any kind for posits, let alone SIMD. I really fail to see your point here.