FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
43% Positive
Analyzed from 455 words in the discussion.
Trending Topics
#cache#microcode#cpu#instruction#instructions#may#code#copies#stores#faster

Discussion (7 Comments)Read Original on HackerNews
https://stackoverflow.com/questions/33902068/what-setup-does...
Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.
You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
But on any modern CPU there should be essentially no penalty for doing that now. Testing the full register is basically free as long as you aren't doing a partial write followed by a full read (write AH then read AX), and I don't think there's any case where this could stall on anything newer than a Core 2 era processor. But just replacing that with a "jnc" or whatever you're exactly trying to test for would be less instructions at least. I'd love to see benchmarks though if someone has dug deeper into this than I have.
But yeah, it may not make a real impact yet anyway.
it is never used with a prefix (the value would be overwritten for each repetition)
...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.