> In fact in Clang branchless_lower_bound is slower than std::lower_bound
That's strange, I've profiled it and I find that branchless_lower_bound is still faster than std::lower_bound using clang14, just not as fast as with gcc12 (on Intel Broadwell). I'm using gcc's libstdc++ in both cases, maybe he was using libc++ with clang?
Edit:
Replacing the contents of the for loop with the following improves performance for clang but reduces performance for gcc:
const size_t increment[] = { 0, step };
begin += increment[compare(begin[step], value)];