Beautiful branchless binary search

(probablydance.com)

386 points ingve | 2 comments | 28 Apr 23 05:46 UTC | HN request time: 0.49s | source

Show context

kookamamie ◴[28 Apr 23 06:13 UTC] No.35738041[source]▶

>>35737862 (OP) #

Trusting the C++ compiler to emit a specific instruction for guaranteed performance - cute, but not realistic.

replies(1): >>35738060 #

vkazanov ◴[28 Apr 23 06:19 UTC] No.35738060[source]▶

>>35738041 #

I don't think it is the instruction that makes the difference. Cmov is a branch either way.

Branches can be unpredictable and slow, no matter what instructions, (or tables, or whatever) are used to introduce the branch at question.

It is predictable nature of branching in this algo that makes the difference. Which makes me wonder...

EDIT: the cmov vs explicit branching story is a bit more complicated than just branch vs no branch.

replies(1): >>35738115 #

2102922286 ◴[28 Apr 23 06:26 UTC] No.35738115[source]▶

>>35738060 #

Cmov doesn't branch. A branch refers specifically to the program counter ending up in more than one possible place after an instruction has executed. It is this behavior that mucks with the CPU state and slows everything down.

It's true that the cmov instruction uses the CPU flags register (which I'm sure the CPU designers at Intel hate), but that doesn't mean that it branches.

You can achieve the same effect as cmov by using bitwise operations and a subtraction, though it'd just be a few cycles slower--but it would be even more clear that it doesn't branch.

replies(4): >>35738178 #>>35738206 #>>35738285 #>>35742166 #

kevingadd ◴[28 Apr 23 06:41 UTC] No.35738206[source]▶

>>35738115 #

The main problem with branches in an algorithm like this is that they create a dependency, where not only do the future instructions depend on the possibly-mispredicted outcome of the previous instruction and need to get thrown out, but you may have started executing the wrong instructions.

A cmov doesn't affect the instruction pointer, so that 'executing the wrong instructions' condition is gone, but it still affects the output of all following instructions if they depend on the register it modifies. So you still have to worry about pipelined work being thrown away on a mispredict.

IMO, cmov was far more valuable back when it was new, because branches were more expensive then. IIRC it was added to x86 in something like the MMX or SSE1 era, so a very long time ago for very different processor architectures. It's still a useful instruction category to have, if only because it lets you produce smaller code (and smaller code is usually faster), but I expect the value is less significant than it used to be.

Incidentally I am curious whether Linus's position on cmov back in 2007 is one he still holds today, and whether his assertions hold up on a modern Intel or AMD CPU. https://yarchive.net/comp/linux/cmov.html

I think the branchless search in this blog post is faster simply because the inner loop is much smaller. The observation that it sometimes does extra comparisons explains why it is slower for an expensive comparison operator, because the advantage of the smaller loop is gone. I don't think it being "branchless" is terribly important in comparison. You could try to fudge an experiment to prove this by inserting extra instructions, but they would need to be ones that decode into actual executed uops, and you'd need to decide whether they should have dependencies or not.

In the bad old days when I got promoted to the programming team at a job they had me take a test, and one of the tasks was to write a transparent blit routine. The one I wrote was "branchless", but because cmov intrinsics weren't widely available I achieved this by using an array of pointers and selecting the array index based on the result of a comparison, something like:

  unsigned char* src_dest[] = { src, dest };
  for (...; ...; src_dest[0]++, src_dest[1]++)
    *src_dest[1] = *src_dest[*src_dest[0] == transparent_value];

the lead who reviewed the test had to come by and ask me to explain how it worked. Certainly, this was "branchless", even without the presence of cmov, since it was just doing movs based on a 0/1 produced by the comparison. But the actual source of the dependency here is the comparison, not a jmp or a cmov. The loop was faster than a branch on the CPUs of that era but these days it would probably be much slower.

replies(2): >>35738262 #>>35738559 #

2102922286 ◴[28 Apr 23 06:48 UTC] No.35738262[source]▶

>>35738206 #

For the given assembly from the blog post

    loop:
        lea (%rdx,%rax,4),%rcx
        cmp (%rcx),%esi
        cmovg %rcx,%rdx
        shr %rax
        jne loop

Here's a simulated CPU trace on Intel skylake: https://uica.uops.info/tmp/2de9d862d05d482ebed576d7e3923b93_...

Note that this tracer makes the assumption that all memory loads are in cache (otherwise the memory lookup will dominate). So bear that in mind for this code, especially since memory reads will likely dominate the cost of a binary search.

Regardless, it appears that the cost of conditional move is not the source of bottleneck.

replies(2): >>35738304 #>>35738355 #

1. jeffreygoesto ◴[28 Apr 23 06:54 UTC] No.35738304[source]▶

>>35738262 #

How do you get such am image please?

replies(1): >>35738352 #

2. 2102922286 ◴[28 Apr 23 07:03 UTC] No.35738352[source]▶

>>35738304 (TP) #

https://uops.info/uiCA.html it's really cool!!

↑