XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?

iingve about 4 hours ago 68 commentsRead Article on devblogs.microsoft.com

DE version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

80% Positive

Analyzed from 2482 words in the discussion.

Discussion (68 Comments)Read Original on HackerNews

Sweepi•about 3 hours ago

"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."

Will remember for the next time I write asm for Itanium!

shawn_w•about 3 hours ago

Quite a few architectures have a dedicated 0 register.

repelsteeltje•about 2 hours ago

Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086).

signa11•about 2 hours ago

indeed. riscv for instance. also, afaik, xor’ing is faster. i would assume that someone like mr. raymond would know…

pif•about 2 hours ago

Which part of "mathematical operations don’t reset the NaT bit" did you not understand?

IshKebab•about 2 hours ago

> afaik, xor’ing is faster

Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.

lynguist•about 2 hours ago

Indeed!!

MIPS - $zero

RISC-V - x0

SPARC - %g0

ARM64 - XZR

NewCzech•about 3 hours ago

The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.

Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.

Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?

arka2147483647•about 3 hours ago

> The answer is so obvious

A tangent, but what is Obvious depends on what you know.

Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.

We should all kind, and explain also the Obvious things those who do not know.

akie•about 3 hours ago

"The proof is left as an exercise for the reader" comes to mind

svnt•about 3 hours ago

His point is that in x86 there is no performance difference but everyone except his colleague/friend uses xor, while sub actually leaves cleaner flags behind. So he suspects its some kind of social convention selected at random and then propagated via spurious arguments in support (or that it “looks cooler” as a bit of a term of art).

It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.

3form•about 3 hours ago

I think an even more likely explanation would be that x86 assembly programmers often were, or learned from other-architecture assembly programmers. Maybe there's a place where it makes more sense and it can be so attributed. 6502 and 68k being first places I would look at.

richrichardsson•about 2 hours ago

For 68k depending on the size you're interested in then it mostly doesn't matter.

.b and .w -> clr eor sub are all identical

for .l moveq #0 is the winner

flohofwoe•about 3 hours ago

That comment is not very useful without pointing to realworld CPUs where SUB is more expensive than XOR ;)

E.g. on Z80 and 6502 both have the same cycle count.

brigade•about 2 hours ago

Cortex A8 vsub reads the second source register a cycle earlier than veor, so that can add one cycle latency

Not scalar, but still sub vs xor. Though you’d use vmov immediate for zeroing anyway.

GoblinSlayer•about 1 hour ago

Harvard Mark I? Not sure why people think programming started with Z80.

Tepix•about 3 hours ago

From TFA:

> It encodes to the same number of bytes, executes in the same number of cycles.

abainbridge•about 1 hour ago

Those aren't the only resources. I could imagine XOR takes less energy because using it might activate less circuitry than SUB.

adrian_b•21 minutes ago

XOR is faster when you do that alone in an FPGA or in an ASIC.

When you do XOR together with many other operations in an ALU (arithmetic-logical unit), the speed is determined by the slowest operation, so the speed of any faster operation does not matter.

This means that in almost all CPUs XOR and addition and subtraction have the same speed, despite the fact that XOR could be done faster.

In a modern pipelined CPU, the clock frequency is normally chosen so that a 64-bit addition can be done in 1 clock cycle, when including all the overheads caused by registers, multiplexers and other circuitry outside the ALU stages.

Operations more complex than 64-bit addition/subtraction have a latency greater than 1 clock cycle, even if one such operation can be initiated every clock cycle in one of the execution pipelines.

The operations less complex than 64-bit addition/subtraction, like XOR, are still executed in 1 clock cycle, so they do not have any speed advantage.

There have existed so-called superpipelined CPUs, where the clock frequency is increased, so that even addition/subtraction has a latency of 2 or more clock cycles.

Only in superpipelined CPUs it would be possible to have a XOR instruction that is faster than subtraction, but I do not know if this has ever been implemented in a real superpipelined CPU, because it could complicate the execution pipeline for negligible performance improvements.

Initially superpipelining was promoted by DEC as a supposedly better alternative to the superscalar processors promoted by IBM. However, later superpipelining was abandoned, because the superscalar approach provides better energy efficiency for the same performance. (I.e. even if for a few years it was thought that a Speed Demon beats a Brainiac, eventually it was proven that a Brainiac beats a Speed Demon, like shown in the Apple CPUs)

While mainstream CPUs do not use superpipelining, there have been some relatively recent IBM POWER CPUs that were superpipelined, but for a different reason than originally proposed. Those POWER CPUs were intended for having good performance only in multi-threaded workloads when using SMT, and not in single-thread applications. So by running simultaneous threads on the same ALU the multi-cycle latency of addition/subtraction was masked. This technique allowed IBM a simpler implementation of a CPU intended to run at 5 GHz or more, by degrading only the single-thread performance, without affecting the SMT performance. Because this would not have provided any advantage when using SMT, I assume that in those POWER CPUs XOR was not made faster than addition, even if this would have theoretically been possible.

mikequinlan•about 3 hours ago

As TFA says, on x86 `sub eax, eax` encodes to the same number of bytes and executes in the same number of cycles.

whizzter•about 2 hours ago

On modern ones, x86 has quite a history and the idiom might carry on from an even older machine.

Edit: Looked at comments, seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

abainbridge•42 minutes ago

> seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

I think that era of CPUs used a single circuit capable of doing add, sub, xor etc. They'd have 8 of them and the signals propagate through them in a row. I think this page explains the situation on the 6502: https://c74project.com/card-b-alu-cu/

And this one for the ARM 1: https://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-...

But I'm a software engineer speculating about how hardware works. You might want to ask a hardware engineer instead.

phire•about 3 hours ago

I'm not actually aware of any CPUs that preform a XOR faster than a SUB. And more importantly, they have identical timings on the 8086, which is where this pattern comes from.

billpg•about 3 hours ago

I had a similar reaction when learning 8086 assembly and finding the correct way to do `if x==y` was a CMP instruction which performed a subtraction and set only the flags. (The book had a section with all the branch instructions to use for a variety of comparison operators.) I think I spent a few minutes experimenting with XOR to see if I could fashion a compare-two-values-and-branch macro that avoided any subtraction.

virexene•about 3 hours ago

The operation is slightly more complex yes, but has there ever been an x86 CPU where SUB or XOR takes more than a single CPU cycle?

praptak•about 3 hours ago

I wonder if you could measure the difference in power consumption.

I mean, not for zeroing because we know from the TFA that it's special-cased anyway. But maybe if you test on different registers?

feverzsj•about 3 hours ago

It's like 0.5 cycles vs 0.9 cycles. So both are 1 cycle, considering synchronization.

pishpash•about 2 hours ago

But energy consumption could be different for this hypothetical 0.5 and 0.9.

scheme271•about 2 hours ago

Energy consumption wasn't really a concern when the idiom developed. I don't think people really cared about the energy consumption of instructions until well into the x86-64 era.

defmacr0•about 2 hours ago

I would be surprised if modern CPUs didn't decode "xor eax, eax" into a set of micro-ops that simply moves from an externally invisible dedicated 0 register. These days the x86 ISA is more of an API contract than an actual representation of what the hardware internals do.

defrost•about 2 hours ago

From TFA:

  The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”.

brigade•about 2 hours ago

Zero micro ops to be precise, that’s handled entirely at the register rename stage with no data movement.

themafia•about 3 hours ago

XOR and SUB have had identical cycle counts and latencies since the 8088. That's because you can "look ahead" when doing carries in binary. It's just a matter of how much floorspace on the chip you want to use.

https://en.wikipedia.org/wiki/Carry-lookahead_adder

The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.

asQuirreL•about 2 hours ago

A carry lookahead adder makes your circuit depth logarithmic in the width of the inputs vs linear for a ripple carry adder, but that is still asymptotically worse than XORs constant depth.

(But this does not discount the fact that basically all CPUs treat them both as one cycle)

jojobas•about 3 hours ago

The non-obvious bit is why there isn't an even faster and shorter "mov <register>,0" instructions - the processors started short-circuiting xor <register>,<register> much later.

bahmboo•about 2 hours ago

Because he is explicitly talking about x86 - maybe you missed that.

nopurpose•about 3 hours ago

It amazes me how entertaining Raymond's writing on most mundane aspects of computing often is.

drfuchs•about 2 hours ago

Relatedly, there's a steganographic opportunity to hide info in machine code by using "XOR rax,rax" for a "zero" and "SUB rax,rax" for a "one" in your executable. Shouldn't be too hard to add a compiler feature to allow you to specify the string you want encoded into its output.

adrian_b•43 minutes ago

It should be noted that XOR is just (bitwise) subtraction modulo 2.

There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.

To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.

Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.

b1temy•about 2 hours ago

Back when I was in university, one of the units touching Assembly[0] required students to use subtraction to zero out the register instead of using the move instruction (which also worked), as it used fewer cycles.

I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.

[0] Not x86, I do not recall the exact architecture.

enduku•about 2 hours ago

I ran into this rabbithole while writing an x86-64 asm rewriter.

xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.

tliltocatl•about 3 hours ago

It might be because XOR is rarely (in terms of static count, dynamically it surely appears a lot in some hot loops) used for anything else, so it is easier to spot and identify as "special" if you are writing manual assembly.

stingraycharles•about 3 hours ago

And helps with SMT

Edit: this is apparently not the case, see @tliltocatl's comment down the thread

tliltocatl•about 3 hours ago

What's SMT in this context?

recursivecaveat•about 2 hours ago

Simultaneous Multi-Threading (hyper-threading as Intel calls it). I'm not a cpu guy, but I think the ALU used for subtraction would be a more valuable resource to leave available to the other thread than whatever implements a xor. Hence you prefer to use the xor for zeroing and conserve the ALU for other threads to use.

kunley•about 2 hours ago

XOR appears a lot in any code touching encryption.

PS. What is static vs dynamic count?

tliltocatl•about 2 hours ago

Static count - how many times an instruction appears in a binary (or assembly source).

Dynamic count - how many times an opcode gets executed.

I. e. an instruction that doesn't appear often in code, but comes up in some hot loops (like encryption) would have low static and high dynamic.

empiricus•about 3 hours ago

The hw implementation of xor is simpler than sub, so it should consume slightly less energy. Wondering how much energy was saved in the whole world by using xor instead of sub.

flohofwoe•about 2 hours ago

I doubt any of that is measurable, since all ALU operations are usually implemented with the same logic (e.g. see https://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-...)

croes•about 1 hour ago

I guess everything what was saved was burned by the first useless image created per AI

anematode•about 3 hours ago

My favorite (admittedly not super useful) trick in this domain is that sbb eax, eax breaks the dependency on the previous value of eax (just like xor and sub) and only depends on the carry flag. arm64 is less obtuse and just gives you csetm (special case of csinv) for this purpose.

defrost•about 3 hours ago

  Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.

And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.

praptak•about 2 hours ago

You still need to explain why this case creates a positive feedback loop rather than a negative one. I mean left/right fuel intakes in cars and male/female ratios somehow tend to balance at 50/50.

ben_w•40 minutes ago

Regarding gender ratios: https://en.wikipedia.org/wiki/Fisher's_principle

There's exceptions, but they tend to be colonial animals in the broadest sense e.g. how clownfish males are famously able to become female but each group has one breeding male and one breeding female at any given time*, or bees where the males (drones) are functionally flying sperm and there's only one fertile female in any given colony; or some reptiles which have a temperature-dependent sex determination that may have been 50/50 before we started causing rapid climate change but in many cases isn't now: https://en.wikipedia.org/wiki/Temperature-dependent_sex_dete...

* Wolves, despite being where nomenclature of "alpha" comes from, are not this. The researcher who coined the term realised they made a mistake and what he thought of as the "alpha" pair were simply the parents of the others in that specific situation: https://davemech.org/wolf-news-and-information/

phenol•about 1 hour ago

products of an asymmetric reaction performed without enantiomeric control can selectively catalyse the formation of more products with the same handedness -- this is called autocatalysis. so the first full reaction might produce a left-handed product (by chance) but that left-handed product will then cause future products to be preferentially left-handed. see the [Soai reaction](https://en.wikipedia.org/wiki/Soai_reaction?wprov=sfla1) for an example of this.

as mentioned by others this is conjectural but it is a popular (if somewhat unfalsifiable) explanation for homochirality

defrost•about 2 hours ago

Wrt amino acids and sugars I personally don't have to explain as a good many others have already.

eg: For one, Isaac Asimov in the 1970s wrote at length on this in his role as a non fiction science writer with a Chemistry Phd

> male/female ratios somehow tend to balance at 50/50.

This is different to the case of actual right handed dominance in humans and to L- Vs R- dominance in chirality ...

( Men and women aren't actual mirror images of each other ... )

JuniperMesos•about 2 hours ago

This is a hypothesis about why the chirality of life on earth is what it is, but I don't think there's enough evidence to state that this (or any competing hypothesis) is definitely the correct explanation.

defrost•about 2 hours ago

Well "definitely correct" has no real place in probabilistic arguments almost by ipso factum absurdum :-)

The chirality argument made is more akin to dynamic systems balance; yes, you can balance a pencil on its point .. but given a bit of random tilt one way or the other it's going to tend to keep going and end near flat on the table.

jhoechtl•about 2 hours ago

Back in the stone ages XOR ing was just 1 byte of opcode. Habbits stick. In effect XORing is no longer faster since a long time.

dragontamer•about 2 hours ago

The XOR trick is implemented as a (malloc from register file) on modern processors, implemented in the decoder and it won't even issue a uOp to the execution pipelines.

Its basically free today. Of course, mov RAX, 0 is also free and does the same thing. But CPUs have limited decoder lengths per clock tick, so the more instructions you fit in a given size, the more parallel a modern CPU can potentially execute.

So.... definitely still use XOR trick today. But really, let the compiler handle it. Its pretty good at keeping track of these things in practice.

-----------

I'm not sure if "sub" is hard-coded to be recognized in the decoder as a zero'd out allocation from the register file. There's only certain instructions that have been guaranteed to do this by Intel/AMD.

flohofwoe•about 2 hours ago

Depending on what's stone-age for you, a SUB with a register was also only one byte, and was the same cost as XOR, at least in the Intel/Zilog lineage all the way back to the 70s ;)

rasz•about 3 hours ago

Looking at some random 1989 Zenith 386SX bios written in assembly so purely programmer preferences:

8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'

26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'

edit: checked a 2010 bios and not a single 'sub x, x'

pishpash•about 2 hours ago

Could be used to express 1 bit of information in some non-obvious convention.

grebc•about 3 hours ago

If you’re not first, you’re last.