←back to thread

386 points ingve | 1 comments | | HN request time: 0.209s | source
Show context
pizza234 ◴[] No.35738299[source]
Some information on the CMOV can be found on the Intel Optimization Reference Manual (https://cdrdv2-public.intel.com/671488/248966-046A-software-...).

Torvalds was famously critical of it (https://yarchive.net/comp/linux/cmov.html); part of the criticism is now moot though, due to low latencies on modern processors (it seems 1 cycle or less, although the instruction consumes internal flags).

His idea seems to be still applicable and consistent with Intel's recommendation: make branches predictable, and only after, use CMOV for the remaining ones. His fundametnal assumption is that "even if you were to know that something is unpredictable, it's going to be very rare.".

replies(3): >>35738345 #>>35741256 #>>35742105 #
2102922286 ◴[] No.35738345[source]
This is one area where Profile-Guided Optimization (PGO) can help a lot! With PGO, you run your program on some sample input and it logs info like how many times each side of a branch was taken. From there, you can recompile your code. If the compiler sees that one side of the branch dominates, it can emit code to prioritize that branch. However if the branch counts are approximately even and the branch is hard to predict (n.b. this is technically distinct from having even branch counts), then the compiler can know that the CPU would have trouble predicting the branch, and can emit a cmov instruction instead.
replies(4): >>35738413 #>>35739365 #>>35740748 #>>35745104 #
xmcqdpt2 ◴[] No.35740748[source]
If your PGO program ends up in a situation which is the opposite of that you profiled, couldn't you end up with vastly worse performance than a non-optimized program?

I've always felt uncomfortable about PGO for more complex programs because of this.

replies(1): >>35745100 #
1. eklitzke ◴[] No.35745100[source]
Right, this is why you should use AutoFDO nowadays not PGO. With AutoFDO you occasionally (e.g. in prod this might be something like record for 1s on average every 300s) record what branches were taken using perf-record from prod binaries, and then feed this back to the compiler.