DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
80% Positive
Analyzed from 2482 words in the discussion.
Trending Topics
#xor#sub#subtraction#bit#register#cpus#same#more#cycle#zero

Discussion (68 Comments)Read Original on HackerNews
Will remember for the next time I write asm for Itanium!
Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.
MIPS - $zero
RISC-V - x0
SPARC - %g0
ARM64 - XZR
Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.
Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
A tangent, but what is Obvious depends on what you know.
Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.
We should all kind, and explain also the Obvious things those who do not know.
It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.
.b and .w -> clr eor sub are all identical
for .l moveq #0 is the winner
E.g. on Z80 and 6502 both have the same cycle count.
Not scalar, but still sub vs xor. Though you’d use vmov immediate for zeroing anyway.
> It encodes to the same number of bytes, executes in the same number of cycles.
When you do XOR together with many other operations in an ALU (arithmetic-logical unit), the speed is determined by the slowest operation, so the speed of any faster operation does not matter.
This means that in almost all CPUs XOR and addition and subtraction have the same speed, despite the fact that XOR could be done faster.
In a modern pipelined CPU, the clock frequency is normally chosen so that a 64-bit addition can be done in 1 clock cycle, when including all the overheads caused by registers, multiplexers and other circuitry outside the ALU stages.
Operations more complex than 64-bit addition/subtraction have a latency greater than 1 clock cycle, even if one such operation can be initiated every clock cycle in one of the execution pipelines.
The operations less complex than 64-bit addition/subtraction, like XOR, are still executed in 1 clock cycle, so they do not have any speed advantage.
There have existed so-called superpipelined CPUs, where the clock frequency is increased, so that even addition/subtraction has a latency of 2 or more clock cycles.
Only in superpipelined CPUs it would be possible to have a XOR instruction that is faster than subtraction, but I do not know if this has ever been implemented in a real superpipelined CPU, because it could complicate the execution pipeline for negligible performance improvements.
Initially superpipelining was promoted by DEC as a supposedly better alternative to the superscalar processors promoted by IBM. However, later superpipelining was abandoned, because the superscalar approach provides better energy efficiency for the same performance. (I.e. even if for a few years it was thought that a Speed Demon beats a Brainiac, eventually it was proven that a Brainiac beats a Speed Demon, like shown in the Apple CPUs)
While mainstream CPUs do not use superpipelining, there have been some relatively recent IBM POWER CPUs that were superpipelined, but for a different reason than originally proposed. Those POWER CPUs were intended for having good performance only in multi-threaded workloads when using SMT, and not in single-thread applications. So by running simultaneous threads on the same ALU the multi-cycle latency of addition/subtraction was masked. This technique allowed IBM a simpler implementation of a CPU intended to run at 5 GHz or more, by degrading only the single-thread performance, without affecting the SMT performance. Because this would not have provided any advantage when using SMT, I assume that in those POWER CPUs XOR was not made faster than addition, even if this would have theoretically been possible.
Edit: Looked at comments, seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.
I think that era of CPUs used a single circuit capable of doing add, sub, xor etc. They'd have 8 of them and the signals propagate through them in a row. I think this page explains the situation on the 6502: https://c74project.com/card-b-alu-cu/
And this one for the ARM 1: https://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-...
But I'm a software engineer speculating about how hardware works. You might want to ask a hardware engineer instead.
I mean, not for zeroing because we know from the TFA that it's special-cased anyway. But maybe if you test on different registers?
https://en.wikipedia.org/wiki/Carry-lookahead_adder
The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.
(But this does not discount the fact that basically all CPUs treat them both as one cycle)
There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.
To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.
Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.
I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.
[0] Not x86, I do not recall the exact architecture.
xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.
Edit: this is apparently not the case, see @tliltocatl's comment down the thread
PS. What is static vs dynamic count?
Dynamic count - how many times an opcode gets executed.
I. e. an instruction that doesn't appear often in code, but comes up in some hot loops (like encryption) would have low static and high dynamic.
There's exceptions, but they tend to be colonial animals in the broadest sense e.g. how clownfish males are famously able to become female but each group has one breeding male and one breeding female at any given time*, or bees where the males (drones) are functionally flying sperm and there's only one fertile female in any given colony; or some reptiles which have a temperature-dependent sex determination that may have been 50/50 before we started causing rapid climate change but in many cases isn't now: https://en.wikipedia.org/wiki/Temperature-dependent_sex_dete...
* Wolves, despite being where nomenclature of "alpha" comes from, are not this. The researcher who coined the term realised they made a mistake and what he thought of as the "alpha" pair were simply the parents of the others in that specific situation: https://davemech.org/wolf-news-and-information/
as mentioned by others this is conjectural but it is a popular (if somewhat unfalsifiable) explanation for homochirality
eg: For one, Isaac Asimov in the 1970s wrote at length on this in his role as a non fiction science writer with a Chemistry Phd
> male/female ratios somehow tend to balance at 50/50.
This is different to the case of actual right handed dominance in humans and to L- Vs R- dominance in chirality ...
( Men and women aren't actual mirror images of each other ... )
The chirality argument made is more akin to dynamic systems balance; yes, you can balance a pencil on its point .. but given a bit of random tilt one way or the other it's going to tend to keep going and end near flat on the table.
Its basically free today. Of course, mov RAX, 0 is also free and does the same thing. But CPUs have limited decoder lengths per clock tick, so the more instructions you fit in a given size, the more parallel a modern CPU can potentially execute.
So.... definitely still use XOR trick today. But really, let the compiler handle it. Its pretty good at keeping track of these things in practice.
-----------
I'm not sure if "sub" is hard-coded to be recognized in the decoder as a zero'd out allocation from the register file. There's only certain instructions that have been guaranteed to do this by Intel/AMD.
8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'
26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'
edit: checked a 2010 bios and not a single 'sub x, x'