←back to thread

386 points ingve | 2 comments | | HN request time: 0.418s | source
Show context
2102922286 ◴[] No.35738100[source]
A cool related algorithm is https://algorithmica.org/en/eytzinger

In addition to being branchless, it also has better cache properties than a standard binary search tree.

If you're doing a binary search in an array, you start in the middle, and then jump all the way to the midway point of one half, and so on. As a result, there's a lot of distance between each read that you do in the array. Thus, each read is putting extra strain on the cache (or the reads won't be cached).

The CPU cache performs much better if the values you want to read are close to each other. Enter the Eytzinger Binary Search. The idea is that, if you're always going to be accessing the root of the tree, and then one of its two children--you should just put those children physically close to the root, so they'll all be on the same cache line!

replies(8): >>35738287 #>>35738451 #>>35738561 #>>35739155 #>>35740049 #>>35740397 #>>35740690 #>>35744101 #
srcreigh ◴[] No.35738451[source]
Trying to figure this out myself. So cache line is 64 bytes. Ignoring pointers you can fit eight 8bit data/key values, or sixteen 4bit data/key values. Pointers are entirely implicit (thx user below)

This would save you 3 or 4 memory reads respectively.

The odds of this strategy helping past the first few layers seems unlikely. So for 100k elements binary tree, this should be a 21% or 31% performance improvement respectively. That’s 17 vs 14 and 17 vs 13 reads.

And this would not help much for string data values unless the strings are at most 8 characters long :-)

This same thinking can be used to optimize database queries. Given several table and index layouts how many 8kb disk reads will be needed in each case. Wish I could get a job doing that kind of design work!

replies(3): >>35738474 #>>35738646 #>>35745053 #
1. atq2119 ◴[] No.35738474[source]
The big advantage of the Eytzinger layout actually comes into play when your binary search isn't branch-free.

In that case, the CPU will speculate loads reasonably far ahead, and because both choices tend to be in the same cacheline, this prefetch is rarely wasted.

In other words, the big win isn't about reducing the number of cacheline fetched, it's about reducing the effective latency of those fetches.

That said, using a similar layout with higher fan-out and SIMD ops doesn't benefit from this effect and is probably still better.

replies(1): >>35738840 #
2. srcreigh ◴[] No.35738840[source]
I think effective latency and reducing the number of cache lines is probably the exact same thing. The point is instead of fetching 5 cache lines to get to the 5th layer of the tree from memory, you fetch just 2 and access the mega root node via cache.