Beautiful branchless binary search

(probablydance.com)

386 points ingve | 3 comments | 28 Apr 23 05:46 UTC | HN request time: 0s | source

Show context

pizza234 ◴[28 Apr 23 06:53 UTC] No.35738299[source]▶

Some information on the CMOV can be found on the Intel Optimization Reference Manual (https://cdrdv2-public.intel.com/671488/248966-046A-software-...).

Torvalds was famously critical of it (https://yarchive.net/comp/linux/cmov.html); part of the criticism is now moot though, due to low latencies on modern processors (it seems 1 cycle or less, although the instruction consumes internal flags).

His idea seems to be still applicable and consistent with Intel's recommendation: make branches predictable, and only after, use CMOV for the remaining ones. His fundametnal assumption is that "even if you were to know that something is unpredictable, it's going to be very rare.".

replies(3): >>35738345 #>>35741256 #>>35742105 #

2102922286 ◴[28 Apr 23 07:02 UTC] No.35738345[source]▶

>>35738299 #

This is one area where Profile-Guided Optimization (PGO) can help a lot! With PGO, you run your program on some sample input and it logs info like how many times each side of a branch was taken. From there, you can recompile your code. If the compiler sees that one side of the branch dominates, it can emit code to prioritize that branch. However if the branch counts are approximately even and the branch is hard to predict (n.b. this is technically distinct from having even branch counts), then the compiler can know that the CPU would have trouble predicting the branch, and can emit a cmov instruction instead.

replies(4): >>35738413 #>>35739365 #>>35740748 #>>35745104 #

josephg ◴[28 Apr 23 09:57 UTC] No.35739365[source]▶

>>35738345 #

I’m always surprised by rust’s performance for this reason. The compiler outputs huge binaries, chock full of bounds checks and the like. But performance doesn’t seem to suffer at all from it. On the contrary - I ported some well optimized C to rust and it ran faster.

I can only assume the compiler is marking all the bounds checks as unlikely to fail, and correctly predicted branches must be more or less free in modern CPUs. It’s the only way I can explain it.

replies(3): >>35739914 #>>35748312 #>>35750685 #

eska ◴[28 Apr 23 11:21 UTC] No.35739914[source]▶

>>35739365 #

The compiler will outright remove bounds checks that are implied by previous bounds checks.

It is a common low level optimization to add a wide boundary check in the beginning to avoid successive smaller bounds checks.

replies(1): >>35740998 #

1. josephg ◴[28 Apr 23 13:23 UTC] No.35740998{3}[source]▶

>>35739914 #

True. But if binary size is anything to go by, an awful lot of bounds checks still end up in the code.

replies(1): >>35741226 #

2. tomsmeding ◴[28 Apr 23 13:46 UTC] No.35741226[source]▶

>>35740998 (TP) #

Doesn't a rust binary also include a whole lot of standard library? C compilers assume that libc is available, a rust compiler is hardly going to assume that rust's libstd is installed on a random user's machine.

replies(1): >>35751925 #

3. josephg ◴[29 Apr 23 11:22 UTC] No.35751925[source]▶

>>35741226 #

Yeah it does, but that’s not enough to explain how large rust programs tend to get. The equivalent C programs, even statically linked, are usually several times smaller than their equivalents in rust.

And the “compile in the standard library” pattern will probably never be changed in rust due to monomorphization. Lots of types in the standard library need to be specialised based on the types of your structs. Even if rust had an ABI for dynamic linking (it doesn’t), structs and functions which take generic type parameters will probably never be able to be loaded from a shared library. Same as C++.

↑