Fastest Branchless Binary Search

(mhdm.dev)

12 points xlinux | 1 comments | 21 Nov 24 06:38 UTC | HN request time: 0.211s | source

Show context

pkhuong ◴[21 Nov 24 13:14 UTC] No.42204012[source]▶

>>42201732 (OP) #

    auto length = last - first;
    while (length > 0) {
       auto half = length / 2;
       if (comp(first[half], value)) {
          // length - half equals half + length % 2
          first += length - half;
       }
       length = half;
    }
    return first;

can be restructured with a conditional `first += half` and instead `length -= half` (see, e.g., Listing 2 in https://arxiv.org/abs/1509.05053). Doing it this way requires one final comparison when `length > 0` on entry, because the while condition is `length > 1`, but moves the subtraction off the critical path.

replies(1): >>42204753 #

ryao ◴[21 Nov 24 14:44 UTC] No.42204753[source]▶

>>42204012 #

  _Pragma("GCC unroll 9")
  while (nelems > 1)  {
    auto half = nelems / 2;
    nelems -= half;
    first += (comp(first[half - 1], value)) * half;
  }
  return first;

As long as we are sharing improvements, here is mine. Microbenchmarking numerous variations found this performed the best. If your arrays are smaller than the unrolling factor, you don’t even loop. You have exactly 1 branch that is executed exactly once in that case. This is the closest to branch free that a binary search can be.

I devised this a couple years ago for use in ZFS’ btree code:

https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...

Feel to have it under your choice of OSI approved license, including CC-0.

replies(1): >>42205569 #

pkhuong ◴[21 Nov 24 15:55 UTC] No.42205569[source]▶

>>42204753 #

Yeah, that's the same structure. For simple comparisons, I would expect a conditional move, 0 branch except for the data-independent loop.

replies(1): >>42206334 #

1. ryao ◴[21 Nov 24 17:06 UTC] No.42206334[source]▶

>>42205569 #

My recollection is that the branch predictor should mispredict the last branch of the loop. Unrolling eliminates that misprediction by replacing all of the loop iteration branches with a jump table branch that it will predict correctly more often. This is micro-optimizing to an extreme. I vaguely recall seeing a slight improvement in a microbenchmark that I attributed to this effect.

↑