←back to thread

103 points vortex_ape | 1 comments | | HN request time: 0.235s | source
Show context
comex ◴[] No.42742950[source]
Incidentally, this automatic branch-if-zero from LLVM is being improved.

First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:

https://github.com/llvm/llvm-project/pull/102885

Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).

If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.

Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.

Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.

(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")

replies(1): >>42744209 #
hinkley ◴[] No.42744209[source]
I was watching a video ranting about bad benchmarks yesterday and in an aside they pointed out the (gcc) generated code used Conditional Move (cmov) in several places to handle and if/else if in the code with no branches.

I think the days of trying to branches by trying to remove conditional assignments are either gone or close to it. You may still have a subsequent data race, but the conditional assignment isn't your biggest problem with throughput.

replies(1): >>42744883 #
1. achierius ◴[] No.42744883[source]
What makes you say that? I've seen several cases where an over-usage of branchless programming actually slowed things down. Especially once you get past 2 nested conditionals (so 4+ pathways) you do just end up executing a lot of ultimately-unused code. In fact this has been going the other direction, in some ways, for a little while now: people overestimate how much branches cost, particularly small, local, and easy-to-predict ones.