Most active commenters

srcreigh(7)
vidarh(3)

Popular/hot comments

>>35738451 #
>>35738646 #
>>35739155 #

←back to thread

Beautiful branchless binary search

(probablydance.com)

1. 2102922286 ◴[28 Apr 23 06:24 UTC] No.35738100[source]▶

>>35737862 (OP) #

A cool related algorithm is https://algorithmica.org/en/eytzinger

In addition to being branchless, it also has better cache properties than a standard binary search tree.

If you're doing a binary search in an array, you start in the middle, and then jump all the way to the midway point of one half, and so on. As a result, there's a lot of distance between each read that you do in the array. Thus, each read is putting extra strain on the cache (or the reads won't be cached).

The CPU cache performs much better if the values you want to read are close to each other. Enter the Eytzinger Binary Search. The idea is that, if you're always going to be accessing the root of the tree, and then one of its two children--you should just put those children physically close to the root, so they'll all be on the same cache line!

replies(8): >>35738287 #>>35738451 #>>35738561 #>>35739155 #>>35740049 #>>35740397 #>>35740690 #>>35744101 #

2. jeffreygoesto ◴[28 Apr 23 06:52 UTC] No.35738287[source]▶

>>35738100 (TP) #

Do they publish if getting the tree/array in shape initially or inserting an element is significantly different in speed to the plain sorted layout? The method is a nice read optimization and you'd need to check if it amortizes when sun together with the tree creation.

3. srcreigh ◴[28 Apr 23 07:17 UTC] No.35738451[source]▶

>>35738100 (TP) #

Trying to figure this out myself. So cache line is 64 bytes. Ignoring pointers you can fit eight 8bit data/key values, or sixteen 4bit data/key values. Pointers are entirely implicit (thx user below)

This would save you 3 or 4 memory reads respectively.

The odds of this strategy helping past the first few layers seems unlikely. So for 100k elements binary tree, this should be a 21% or 31% performance improvement respectively. That’s 17 vs 14 and 17 vs 13 reads.

And this would not help much for string data values unless the strings are at most 8 characters long :-)

This same thinking can be used to optimize database queries. Given several table and index layouts how many 8kb disk reads will be needed in each case. Wish I could get a job doing that kind of design work!

replies(3): >>35738474 #>>35738646 #>>35745053 #

4. atq2119 ◴[28 Apr 23 07:21 UTC] No.35738474[source]▶

>>35738451 #

The big advantage of the Eytzinger layout actually comes into play when your binary search isn't branch-free.

In that case, the CPU will speculate loads reasonably far ahead, and because both choices tend to be in the same cacheline, this prefetch is rarely wasted.

In other words, the big win isn't about reducing the number of cacheline fetched, it's about reducing the effective latency of those fetches.

That said, using a similar layout with higher fan-out and SIMD ops doesn't benefit from this effect and is probably still better.

replies(1): >>35738840 #

5. zelphirkalt ◴[28 Apr 23 07:35 UTC] No.35738561[source]▶

>>35738100 (TP) #

Isn't that like storing a heap inside an array, with indices and values like this:

    0: top of heap,
    1: left of top,
    2: right of top,
    3: left of left of top
    ...

?

replies(1): >>35739255 #

6. vidarh ◴[28 Apr 23 07:48 UTC] No.35738646[source]▶

>>35738451 #

You don't store data nodes, or pointers. You store the keys. If you need to store data, just store them at the same index in another array.

The pointers are wholly implicit. E.g. assuming you let the index 0 be empty, the left pointer for a node k is at 2k and the right node is at 2k+1.

In terms of benefit here, the main benefit appears to be that since you always know where the children will be, you can trade bandwidth for latency and trigger pre-fetching of the next step down before you've even decided if you're going left or right.

replies(3): >>35738793 #>>35738949 #>>35750668 #

7. srcreigh ◴[28 Apr 23 08:13 UTC] No.35738793{3}[source]▶

>>35738646 #

That’s what I meant re keys, but ya nice point about pointers. Forgetting details of my data structures classes. updated my post (used to count 1 bit per node for pointers)

8. srcreigh ◴[28 Apr 23 08:21 UTC] No.35738840{3}[source]▶

>>35738474 #

I think effective latency and reducing the number of cache lines is probably the exact same thing. The point is instead of fetching 5 cache lines to get to the 5th layer of the tree from memory, you fetch just 2 and access the mega root node via cache.

9. WithinReason ◴[28 Apr 23 08:41 UTC] No.35738949{3}[source]▶

>>35738646 #

So binary search should actually not be binary but... octary?

replies(2): >>35739068 #>>35740286 #

10. vidarh ◴[28 Apr 23 09:06 UTC] No.35739068{4}[source]▶

>>35738949 #

I wondered about this too, there's an ongoing tradeoff there between the cost of computing which branch to take vs. how much that improves locality.

But seems they did as well. See the graph at the bottom which compares against a (Eytzinger-style) B-tree, which is basically taking that approach. You do more comparisons per node to figure out which branch to take, but fetch less memory.

Given the overall performance they show is very similar, the question if you want to scale this basically becomes whether your system saturates available cores or available memory bandwidth first.

(Would have been interesting to see comparisons of a quaternary and octonary version as well, though as their B-tree test seems to rely on 16 branches, and maybe the sweet spot is somewhere in between)

11. srcreigh ◴[28 Apr 23 09:21 UTC] No.35739155[source]▶

>>35738100 (TP) #

Pretty cool stuff. The algorithm returns the largest item that’s at most the size of x. At each step it checks array[k] and then checks either 2k or 2k+1 next.

The last k will be past the end of the array so you have to backtrack to discover the actual lower bound.

k records it’s turns in binary such as 10111. After another right turn it’s 101111. So to backtrack, you strip off the trailing 1s.

How to do that though? There’s an assembly instruction called ffs that will give you the first bit set from the right side. So invert k to get 010000, ffs gives you 4. So for k 101111, shift forward by 4, you get 10. Cancel the right turns, now you have the index of the lower bound.

replies(3): >>35739267 #>>35740458 #>>35740555 #

12. contravariant ◴[28 Apr 23 09:39 UTC] No.35739255[source]▶

>>35738561 #

It is, but so is storing a sorted array.

replies(1): >>35739531 #

13. contravariant ◴[28 Apr 23 09:41 UTC] No.35739267[source]▶

>>35739155 #

Don't you just need to remove the trailing 1s? In that case it's just `x & (x+1)`

replies(1): >>35740636 #

14. account42 ◴[28 Apr 23 10:28 UTC] No.35739531{3}[source]▶

>>35739255 #

It's still misleading to call this a binary search when that is commonly understood as an algorithm operating on a sorted range which is not just an arbitrary datastructure but something that already exists in many situations. If you just need fast lower_bound lookup for a set that you control and which has no other uses then sure this is nice, but more commonly you only care if an element is in the set and in that case if you can control the data structure then there are many more options.

replies(1): >>35739888 #

15. eska ◴[28 Apr 23 11:19 UTC] No.35739888{4}[source]▶

>>35739531 #

I agree with you.

Algorithms don’t exist in a bubble, but are always accompanied with a data structure.

A heap sorted array is different from a linearly sorted array. Binary search is only defined for the latter.

16. mihaic ◴[28 Apr 23 11:38 UTC] No.35740049[source]▶

>>35738100 (TP) #

Thinking about cache-optimality, I'm wondering if anyone is using a hybrid: the first levels are in this way (index K branches to 2K and 2K+1 if 1-based), but the end levels are consecutive to then can be loaded in a single cacheline anyway.

17. martincmartin ◴[28 Apr 23 12:07 UTC] No.35740286{4}[source]▶

>>35738949 #

Yes, these are the same issues faced by databases on spinning rust disks. The solution was to figure out how many keys fit in a disk block / cache line, let's call it k, then have a k-ary tree instead of a binary tree. This is called a B-tree.

https://en.wikipedia.org/wiki/B-tree

With caching on modern CPUs, as opposed to databases on disks, there are multiple levels of caching and you may not know the size of the caches. And the caches can be shared with other work you're doing, or other processes are doing, so the effective size can be smaller. So there's some benefit to algorithms that are independent of cache size.

replies(1): >>35742000 #

18. blondin ◴[28 Apr 23 12:18 UTC] No.35740397[source]▶

>>35738100 (TP) #

are you going to have a printed version of your book?

19. ◴[28 Apr 23 12:26 UTC] No.35740458[source]▶

>>35739155 #

20. srcreigh ◴[28 Apr 23 12:38 UTC] No.35740555[source]▶

>>35739155 #

typo: smallest number at least as big as x

21. srcreigh ◴[28 Apr 23 12:47 UTC] No.35740636{3}[source]▶

>>35739267 #

No, that would just set trailing 1s to 0. We want to trim them.

The idea behind trimming trailing 1s is that they represent a sub tree that is all smaller than the lower bound. By trimming those 1s, we traverse back up the tree to find the smallest number which is larger than or equal to the lower bound (aka the answer).

22. utopcell ◴[28 Apr 23 12:53 UTC] No.35740690[source]▶

>>35738100 (TP) #

Eytzinger layouts are great, but they require more space if one is not allowed to touch the original sorted array. Then again, if you are allowed to use extra space, lower_bound operations (aka binary searching) can be implemented in worst-case constant time for integer arrays.

23. vidarh ◴[28 Apr 23 14:52 UTC] No.35742000{5}[source]▶

>>35740286 #

But note that, as per their benchmark against B-tree's, in memory the cost of the memory fetch is sufficiently small that the cost of the comparisons might wipe out the savings.

When retrieving from disk you have far more cycles to work with before less disk-read sized blocks can get close to competitive.

24. vlovich123 ◴[28 Apr 23 17:05 UTC] No.35744101[source]▶

>>35738100 (TP) #

IIRC it really only works well if you have a rarely mutated structure that's read-heavy. Otherwise every time you try to mutate the array, you have to rebuild the layout & it's typically heavier than just sorting (i.e. you have to sort + regenerate the layout). It's a neat concept for sure.

25. srcreigh ◴[28 Apr 23 18:14 UTC] No.35745053[source]▶

>>35738451 #

typo: mixed up bit and byte when referring to the different integer sizes

26. rocqua ◴[29 Apr 23 06:51 UTC] No.35750668{3}[source]▶

>>35738646 #

How does it help pre-fetching?

↑