ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
75% Positive
Analyzed from 1082 words in the discussion.
Trending Topics
#bit#ieee#double#point#precision#latency#performance#floating#bits#more

Discussion (21 Comments)Read Original on HackerNews
It seems that life is imitating art.
https://github.com/sdd/ieee754-rrp
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
Someome didn't try it on GPU...
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
https://gcc.godbolt.org/z/7155YKTrK
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
What do you mean by this? In C 1.0 is a double.
> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.
[0] https://en.wikipedia.org/wiki/Minifloat
I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.
One of the striking things, though, is just how radically different KCS was. By the time IEEE 754 forms, there is a basic commonality of how floating-point numbers work. Most systems have a single-precision and double-precision form, and many have an additional extended-precision form. These formats are usually radix-2, with a sign bit, a biased exponent, and an integer mantissa, and several implementations had hit on the implicit integer bit representation. (See http://www.quadibloc.com/comp/cp0201.htm for a tour of several pre-IEEE 754 floating-point formats). What KCS did that was really new was add denormals, and this was very controversial. I also think that support for infinities was introduced with KCS, although there were more precedents for the existence of NaN-like values. I'm also pretty sure that sticky bits as opposed to trapping for exceptions was considered innovative. (See, e.g., https://ethw-images.s3.us-east-va.perf.cloud.ovh.us/ieee/f/f... for a discussion of the differences between the early drafts.)
Now, once IEEE 754 came out, pretty much every subsequent implementation of floating-point has started from the IEEE 754 standard. But it was definitely not a codification of existing behavior when it came out, given the number of innovations that it had!
S:E:l:M
S = sign bit present (or magnitude-only absolute value)
E = exponent bits (typically biased by 2^(E-1) - 1)
l = explicit leading integer present (almost always 0 because the leading digit is always 1 for normals, 0 for denormals, and not very useful for special values)
M = mantissa (fraction) bits
The limitations of FP4 are that it lacks infinities, [sq]NaNs, and denormals that make it very limited to special purposes only. There's no denying that it might be extremely efficient for very particular problems.
If a more even distribution were needed, a simpler fixed point format like 1:2:1 (sign:integer:fraction bits) is possible.