Beautiful branchless binary search

(probablydance.com)

386 points ingve | 1 comments | 28 Apr 23 05:46 UTC | HN request time: 0.209s | source

Show context

pizza234 ◴[28 Apr 23 06:53 UTC] No.35738299[source]▶

Some information on the CMOV can be found on the Intel Optimization Reference Manual (https://cdrdv2-public.intel.com/671488/248966-046A-software-...).

Torvalds was famously critical of it (https://yarchive.net/comp/linux/cmov.html); part of the criticism is now moot though, due to low latencies on modern processors (it seems 1 cycle or less, although the instruction consumes internal flags).

His idea seems to be still applicable and consistent with Intel's recommendation: make branches predictable, and only after, use CMOV for the remaining ones. His fundametnal assumption is that "even if you were to know that something is unpredictable, it's going to be very rare.".

replies(3): >>35738345 #>>35741256 #>>35742105 #

2102922286 ◴[28 Apr 23 07:02 UTC] No.35738345[source]▶

>>35738299 #

This is one area where Profile-Guided Optimization (PGO) can help a lot! With PGO, you run your program on some sample input and it logs info like how many times each side of a branch was taken. From there, you can recompile your code. If the compiler sees that one side of the branch dominates, it can emit code to prioritize that branch. However if the branch counts are approximately even and the branch is hard to predict (n.b. this is technically distinct from having even branch counts), then the compiler can know that the CPU would have trouble predicting the branch, and can emit a cmov instruction instead.

replies(4): >>35738413 #>>35739365 #>>35740748 #>>35745104 #

xmcqdpt2 ◴[28 Apr 23 13:00 UTC] No.35740748[source]▶

>>35738345 #

If your PGO program ends up in a situation which is the opposite of that you profiled, couldn't you end up with vastly worse performance than a non-optimized program?

I've always felt uncomfortable about PGO for more complex programs because of this.

replies(1): >>35745100 #

1. eklitzke ◴[28 Apr 23 18:17 UTC] No.35745100[source]▶

>>35740748 #

Right, this is why you should use AutoFDO nowadays not PGO. With AutoFDO you occasionally (e.g. in prod this might be something like record for 1s on average every 300s) record what branches were taken using perf-record from prod binaries, and then feed this back to the compiler.

↑