One thing I noticed is the author doesn't compare against linear search for reference. I am personally very curious at what container size either lower_bound or branchless_lower_bound start to outperform a linear scan on modern hardware with modern L1 cache sizes etc.
In the thing I've been playing with -- very unscientifically with a vector of up to 16 sorted u8s:
On x86_64, an SSE optimized vector scan like this: https://github.com/armon/libart/blob/master/src/art.c#L426 is slightly faster than linear scan, which is in turn slighty faster than binary search (the latter two are very close)
However on M1 Mac, simple binary search outperforms a NEON SIMD optimized search which in turn is basically tied with linear scan. Sometimes. The NEON algorithm is trickier than the SSE because NEON lacks an equivalent of SSE's _mm_movemask_epi8