Beautiful branchless binary search

(probablydance.com)

386 points ingve | 3 comments | 28 Apr 23 05:46 UTC | HN request time: 0.744s | source

Show context

buybackoff ◴[28 Apr 23 08:11 UTC] No.35738775[source]▶

Branchless or not in this case, it still touches memory in not so good pattern. I found that a significant speedup of a classic BS could be achieved by switching to vectorized search when the remaining range has a width of 3-4 SIMD lines (or maybe even a little more). The bounds of that range are likely already touched and in cache, then prefetching helps. It has high likelihood of finding the target on a single SIMD compare, then switch to linear search on small data that is already in L1. It gives 30-50% gain on 1K items array of integers, 10-25% on 1M items, depending on data distribution. Here is an example in C#: https://github.com/Spreads/Spreads/blob/main/src/Spreads.Cor...

replies(1): >>35740681 #

1. slashdev ◴[28 Apr 23 12:52 UTC] No.35740681[source]▶

>>35738775 #

I’d be surprised if SIMD + branch to switch algorithms beats branchless binary search for data in cache. Were you comparing with that? Not saying it isn’t possible, god knows CPUs surprise in all kinds of wonderful ways. It’s not the outcome I would have guessed though.

4x AVX2 vectors of 32bit integers is 32 integers. So log2(32) = 5, 5 cmov instructions or a largely unpredictable branch + simd instructions and either more branches or more cmovs to get the result.

It's possible if you were testing on similar sized arrays in a loop that the unpredictable branch becomes 100% predictable, which I think would give the result you saw.

replies(1): >>35745036 #

2. eklitzke ◴[28 Apr 23 18:13 UTC] No.35745036[source]▶

>>35740681 (TP) #

I haven't tested this either, but why would the branch be "largely unpredictable"? It's always going to evaluate to false except once, and the one time it evaluates to true the rest of the search will be finished within the branch.

replies(1): >>35748829 #

3. slashdev ◴[29 Apr 23 00:36 UTC] No.35748829[source]▶

>>35745036 #

I’m referring to exactly that misprediction when the branch triggers. It’s only predictable if the input arrays always has the same length log2, and it’s called in a loop so the history is there.

↑