Beautiful branchless binary search

This is a great article I can't wait to dig into (and the comments here) as soon as I can unshift my brain from the gear it's in.

One thing I noticed is the author doesn't compare against linear search for reference. I am personally very curious at what container size either lower_bound or branchless_lower_bound start to outperform a linear scan on modern hardware with modern L1 cache sizes etc.

In the thing I've been playing with -- very unscientifically with a vector of up to 16 sorted u8s:

On x86_64, an SSE optimized vector scan like this: https://github.com/armon/libart/blob/master/src/art.c#L426 is slightly faster than linear scan, which is in turn slighty faster than binary search (the latter two are very close)

However on M1 Mac, simple binary search outperforms a NEON SIMD optimized search which in turn is basically tied with linear scan. Sometimes. The NEON algorithm is trickier than the SSE because NEON lacks an equivalent of SSE's _mm_movemask_epi8