Branchless or not in this case, it still touches memory in not so good pattern. I found that a significant speedup of a classic BS could be achieved by switching to vectorized search when the remaining range has a width of 3-4 SIMD lines (or maybe even a little more). The bounds of that range are likely already touched and in cache, then prefetching helps. It has high likelihood of finding the target on a single SIMD compare, then switch to linear search on small data that is already in L1. It gives 30-50% gain on 1K items array of integers, 10-25% on 1M items, depending on data distribution. Here is an example in C#: https://github.com/Spreads/Spreads/blob/main/src/Spreads.Cor...
replies(1):