Beautiful branchless binary search

1. pizza234 ◴[28 Apr 23 06:53 UTC] No.35738299[source]▶

Some information on the CMOV can be found on the Intel Optimization Reference Manual (https://cdrdv2-public.intel.com/671488/248966-046A-software-...).

Torvalds was famously critical of it (https://yarchive.net/comp/linux/cmov.html); part of the criticism is now moot though, due to low latencies on modern processors (it seems 1 cycle or less, although the instruction consumes internal flags).

His idea seems to be still applicable and consistent with Intel's recommendation: make branches predictable, and only after, use CMOV for the remaining ones. His fundametnal assumption is that "even if you were to know that something is unpredictable, it's going to be very rare.".

replies(3): >>35738345 #>>35741256 #>>35742105 #

2. 2102922286 ◴[28 Apr 23 07:02 UTC] No.35738345[source]▶

>>35738299 (TP) #

This is one area where Profile-Guided Optimization (PGO) can help a lot! With PGO, you run your program on some sample input and it logs info like how many times each side of a branch was taken. From there, you can recompile your code. If the compiler sees that one side of the branch dominates, it can emit code to prioritize that branch. However if the branch counts are approximately even and the branch is hard to predict (n.b. this is technically distinct from having even branch counts), then the compiler can know that the CPU would have trouble predicting the branch, and can emit a cmov instruction instead.

replies(4): >>35738413 #>>35739365 #>>35740748 #>>35745104 #

3. berkut ◴[28 Apr 23 07:12 UTC] No.35738413[source]▶

>>35738345 #

Yeah, it's mildly annoying there's no (at least to my knowledge?) compiler hint like the '__builtin_expect' ones to tell compilers that the values are very unlikely going to be predictable with enough accuracy to allow the branch predictors to be useful in general, and to use a cmov instead of the traditional branching instructions because of this.

replies(2): >>35738478 #>>35739092 #

4. dist1ll ◴[28 Apr 23 07:21 UTC] No.35738478{3}[source]▶

>>35738413 #

clang has __builtin_unpredictable [0]

[0] https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...

replies(2): >>35738484 #>>35747016 #

5. berkut ◴[28 Apr 23 07:22 UTC] No.35738484{4}[source]▶

>>35738478 #

Oooooh :)

Thanks!

6. gpderetta ◴[28 Apr 23 09:11 UTC] No.35739092{3}[source]▶

>>35738413 #

GCC has __builtin_expect_with_probability now. YMMV.

7. josephg ◴[28 Apr 23 09:57 UTC] No.35739365[source]▶

>>35738345 #

I’m always surprised by rust’s performance for this reason. The compiler outputs huge binaries, chock full of bounds checks and the like. But performance doesn’t seem to suffer at all from it. On the contrary - I ported some well optimized C to rust and it ran faster.

I can only assume the compiler is marking all the bounds checks as unlikely to fail, and correctly predicted branches must be more or less free in modern CPUs. It’s the only way I can explain it.

replies(3): >>35739914 #>>35748312 #>>35750685 #

8. eska ◴[28 Apr 23 11:21 UTC] No.35739914{3}[source]▶

>>35739365 #

The compiler will outright remove bounds checks that are implied by previous bounds checks.

It is a common low level optimization to add a wide boundary check in the beginning to avoid successive smaller bounds checks.

replies(1): >>35740998 #

9. xmcqdpt2 ◴[28 Apr 23 13:00 UTC] No.35740748[source]▶

>>35738345 #

If your PGO program ends up in a situation which is the opposite of that you profiled, couldn't you end up with vastly worse performance than a non-optimized program?

I've always felt uncomfortable about PGO for more complex programs because of this.

replies(1): >>35745100 #

10. josephg ◴[28 Apr 23 13:23 UTC] No.35740998{4}[source]▶

>>35739914 #

True. But if binary size is anything to go by, an awful lot of bounds checks still end up in the code.

replies(1): >>35741226 #

11. tomsmeding ◴[28 Apr 23 13:46 UTC] No.35741226{5}[source]▶

>>35740998 #

Doesn't a rust binary also include a whole lot of standard library? C compilers assume that libc is available, a rust compiler is hardly going to assume that rust's libstd is installed on a random user's machine.

replies(1): >>35751925 #

12. zulu-inuoe ◴[28 Apr 23 13:49 UTC] No.35741256[source]▶

>>35738299 (TP) #

404 on that Torvalds link

replies(2): >>35741714 #>>35742100 #

13. pizza234 ◴[28 Apr 23 14:31 UTC] No.35741714[source]▶

>>35741256 #

The HN renderer is rendering the closing round bracket and semicolon as part of the HTML link, which is unfortunate for this case, but good to know :)

14. msla ◴[28 Apr 23 15:00 UTC] No.35742100[source]▶

>>35741256 #

https://yarchive.net/comp/linux/cmov.html

15. msla ◴[28 Apr 23 15:00 UTC] No.35742105[source]▶

>>35738299 (TP) #

https://yarchive.net/comp/linux/cmov.html

16. eklitzke ◴[28 Apr 23 18:17 UTC] No.35745100{3}[source]▶

>>35740748 #

Right, this is why you should use AutoFDO nowadays not PGO. With AutoFDO you occasionally (e.g. in prod this might be something like record for 1s on average every 300s) record what branches were taken using perf-record from prod binaries, and then feed this back to the compiler.

17. eklitzke ◴[28 Apr 23 18:18 UTC] No.35745104[source]▶

>>35738345 #

I commented on this elsewhere as well, but nowadays you should reach first for AutoFDO rather than PGO.

18. usefulcat ◴[28 Apr 23 20:57 UTC] No.35747016{4}[source]▶

>>35738478 #

It does, but unfortunately it doesn't always work. For example (note the -march=broadwell option):

https://godbolt.org/z/n1Kz4G7GE

19. winstonewert ◴[28 Apr 23 23:11 UTC] No.35748312{3}[source]▶

>>35739365 #

Rust's large file sizes are due to relying on static linking, not particularly the extra bounds checking.

20. rocqua ◴[29 Apr 23 06:55 UTC] No.35750685{3}[source]▶

>>35739365 #

I believe that in release mode Rust will remove bounds checks.

replies(1): >>35750975 #

21. josephg ◴[29 Apr 23 07:52 UTC] No.35750975{4}[source]▶

>>35750685 #

No, this is not true. Rust removes integer overflow checks in release mode, but it keeps array bounds checks.

You can see this pretty easily in compiler explorer. This is a simple array lookup compiled in release mode:

https://rust.godbolt.org/z/dhz34KEvj

22. josephg ◴[29 Apr 23 11:22 UTC] No.35751925{6}[source]▶

>>35741226 #

Yeah it does, but that’s not enough to explain how large rust programs tend to get. The equivalent C programs, even statically linked, are usually several times smaller than their equivalents in rust.

And the “compile in the standard library” pattern will probably never be changed in rust due to monomorphization. Lots of types in the standard library need to be specialised based on the types of your structs. Even if rust had an ABI for dynamic linking (it doesn’t), structs and functions which take generic type parameters will probably never be able to be loaded from a shared library. Same as C++.