Beautiful branchless binary search

(probablydance.com)

386 points ingve | 4 comments | 28 Apr 23 05:46 UTC | HN request time: 0s | source

Show context

2102922286 ◴[28 Apr 23 06:24 UTC] No.35738100[source]▶

A cool related algorithm is https://algorithmica.org/en/eytzinger

In addition to being branchless, it also has better cache properties than a standard binary search tree.

If you're doing a binary search in an array, you start in the middle, and then jump all the way to the midway point of one half, and so on. As a result, there's a lot of distance between each read that you do in the array. Thus, each read is putting extra strain on the cache (or the reads won't be cached).

The CPU cache performs much better if the values you want to read are close to each other. Enter the Eytzinger Binary Search. The idea is that, if you're always going to be accessing the root of the tree, and then one of its two children--you should just put those children physically close to the root, so they'll all be on the same cache line!

replies(8): >>35738287 #>>35738451 #>>35738561 #>>35739155 #>>35740049 #>>35740397 #>>35740690 #>>35744101 #

srcreigh ◴[28 Apr 23 07:17 UTC] No.35738451[source]▶

>>35738100 #

Trying to figure this out myself. So cache line is 64 bytes. Ignoring pointers you can fit eight 8bit data/key values, or sixteen 4bit data/key values. Pointers are entirely implicit (thx user below)

This would save you 3 or 4 memory reads respectively.

The odds of this strategy helping past the first few layers seems unlikely. So for 100k elements binary tree, this should be a 21% or 31% performance improvement respectively. That’s 17 vs 14 and 17 vs 13 reads.

And this would not help much for string data values unless the strings are at most 8 characters long :-)

This same thinking can be used to optimize database queries. Given several table and index layouts how many 8kb disk reads will be needed in each case. Wish I could get a job doing that kind of design work!

replies(3): >>35738474 #>>35738646 #>>35745053 #

vidarh ◴[28 Apr 23 07:48 UTC] No.35738646[source]▶

>>35738451 #

You don't store data nodes, or pointers. You store the keys. If you need to store data, just store them at the same index in another array.

The pointers are wholly implicit. E.g. assuming you let the index 0 be empty, the left pointer for a node k is at 2k and the right node is at 2k+1.

In terms of benefit here, the main benefit appears to be that since you always know where the children will be, you can trade bandwidth for latency and trigger pre-fetching of the next step down before you've even decided if you're going left or right.

replies(3): >>35738793 #>>35738949 #>>35750668 #

1. WithinReason ◴[28 Apr 23 08:41 UTC] No.35738949{3}[source]▶

>>35738646 #

So binary search should actually not be binary but... octary?

replies(2): >>35739068 #>>35740286 #

2. vidarh ◴[28 Apr 23 09:06 UTC] No.35739068[source]▶

>>35738949 (TP) #

I wondered about this too, there's an ongoing tradeoff there between the cost of computing which branch to take vs. how much that improves locality.

But seems they did as well. See the graph at the bottom which compares against a (Eytzinger-style) B-tree, which is basically taking that approach. You do more comparisons per node to figure out which branch to take, but fetch less memory.

Given the overall performance they show is very similar, the question if you want to scale this basically becomes whether your system saturates available cores or available memory bandwidth first.

(Would have been interesting to see comparisons of a quaternary and octonary version as well, though as their B-tree test seems to rely on 16 branches, and maybe the sweet spot is somewhere in between)

3. martincmartin ◴[28 Apr 23 12:07 UTC] No.35740286[source]▶

>>35738949 (TP) #

Yes, these are the same issues faced by databases on spinning rust disks. The solution was to figure out how many keys fit in a disk block / cache line, let's call it k, then have a k-ary tree instead of a binary tree. This is called a B-tree.

https://en.wikipedia.org/wiki/B-tree

With caching on modern CPUs, as opposed to databases on disks, there are multiple levels of caching and you may not know the size of the caches. And the caches can be shared with other work you're doing, or other processes are doing, so the effective size can be smaller. So there's some benefit to algorithms that are independent of cache size.

replies(1): >>35742000 #

4. vidarh ◴[28 Apr 23 14:52 UTC] No.35742000[source]▶

>>35740286 #

But note that, as per their benchmark against B-tree's, in memory the cost of the memory fetch is sufficiently small that the cost of the comparisons might wipe out the savings.

When retrieving from disk you have far more cycles to work with before less disk-read sized blocks can get close to competitive.

↑