Beautiful branchless binary search

1. kookamamie ◴[28 Apr 23 06:13 UTC] No.35738041[source]▶

Trusting the C++ compiler to emit a specific instruction for guaranteed performance - cute, but not realistic.

2. vkazanov ◴[28 Apr 23 06:19 UTC] No.35738060[source]▶

I don't think it is the instruction that makes the difference. Cmov is a branch either way.

Branches can be unpredictable and slow, no matter what instructions, (or tables, or whatever) are used to introduce the branch at question.

It is predictable nature of branching in this algo that makes the difference. Which makes me wonder...

EDIT: the cmov vs explicit branching story is a bit more complicated than just branch vs no branch.

replies(1): >>35738115 #

3. 2102922286 ◴[28 Apr 23 06:26 UTC] No.35738115[source]▶

>>35738060 #

Cmov doesn't branch. A branch refers specifically to the program counter ending up in more than one possible place after an instruction has executed. It is this behavior that mucks with the CPU state and slows everything down.

It's true that the cmov instruction uses the CPU flags register (which I'm sure the CPU designers at Intel hate), but that doesn't mean that it branches.

You can achieve the same effect as cmov by using bitwise operations and a subtraction, though it'd just be a few cycles slower--but it would be even more clear that it doesn't branch.

replies(4): >>35738178 #>>35738206 #>>35738285 #>>35742166 #

4. vkazanov ◴[28 Apr 23 06:38 UTC] No.35738178{3}[source]▶

>>35738115 #

Sure, it doesn't change the PC. But it can introduce a branch indirectly, I.e. when there is a jump to an address MOVed by cmov.

Either way, it seems that the wisdom has change since the last time I wrote and read assembly. Cmov used to be slow. It seems that the current answer is "it depends".

replies(1): >>35738226 #

5. kevingadd ◴[28 Apr 23 06:41 UTC] No.35738206{3}[source]▶

>>35738115 #

The main problem with branches in an algorithm like this is that they create a dependency, where not only do the future instructions depend on the possibly-mispredicted outcome of the previous instruction and need to get thrown out, but you may have started executing the wrong instructions.

A cmov doesn't affect the instruction pointer, so that 'executing the wrong instructions' condition is gone, but it still affects the output of all following instructions if they depend on the register it modifies. So you still have to worry about pipelined work being thrown away on a mispredict.

IMO, cmov was far more valuable back when it was new, because branches were more expensive then. IIRC it was added to x86 in something like the MMX or SSE1 era, so a very long time ago for very different processor architectures. It's still a useful instruction category to have, if only because it lets you produce smaller code (and smaller code is usually faster), but I expect the value is less significant than it used to be.

Incidentally I am curious whether Linus's position on cmov back in 2007 is one he still holds today, and whether his assertions hold up on a modern Intel or AMD CPU. https://yarchive.net/comp/linux/cmov.html

I think the branchless search in this blog post is faster simply because the inner loop is much smaller. The observation that it sometimes does extra comparisons explains why it is slower for an expensive comparison operator, because the advantage of the smaller loop is gone. I don't think it being "branchless" is terribly important in comparison. You could try to fudge an experiment to prove this by inserting extra instructions, but they would need to be ones that decode into actual executed uops, and you'd need to decide whether they should have dependencies or not.

In the bad old days when I got promoted to the programming team at a job they had me take a test, and one of the tasks was to write a transparent blit routine. The one I wrote was "branchless", but because cmov intrinsics weren't widely available I achieved this by using an array of pointers and selecting the array index based on the result of a comparison, something like:

  unsigned char* src_dest[] = { src, dest };
  for (...; ...; src_dest[0]++, src_dest[1]++)
    *src_dest[1] = *src_dest[*src_dest[0] == transparent_value];

the lead who reviewed the test had to come by and ask me to explain how it worked. Certainly, this was "branchless", even without the presence of cmov, since it was just doing movs based on a 0/1 produced by the comparison. But the actual source of the dependency here is the comparison, not a jmp or a cmov. The loop was faster than a branch on the CPUs of that era but these days it would probably be much slower.

replies(2): >>35738262 #>>35738559 #

6. 2102922286 ◴[28 Apr 23 06:44 UTC] No.35738226{4}[source]▶

>>35738178 #

If what you're saying is (roughly)

        cmovne  rax, rdx
        jmp     rax

that is, a cmov followed by an indirect jump to the address contained in rax, "jmp rax" is _always_ an indirect jump. It doesn't matter whether rax was set via a conditional move instruction or not.

7. 2102922286 ◴[28 Apr 23 06:48 UTC] No.35738262{4}[source]▶

>>35738206 #

For the given assembly from the blog post

    loop:
        lea (%rdx,%rax,4),%rcx
        cmp (%rcx),%esi
        cmovg %rcx,%rdx
        shr %rax
        jne loop

Here's a simulated CPU trace on Intel skylake: https://uica.uops.info/tmp/2de9d862d05d482ebed576d7e3923b93_...

Note that this tracer makes the assumption that all memory loads are in cache (otherwise the memory lookup will dominate). So bear that in mind for this code, especially since memory reads will likely dominate the cost of a binary search.

Regardless, it appears that the cost of conditional move is not the source of bottleneck.

replies(2): >>35738304 #>>35738355 #

8. Dwedit ◴[28 Apr 23 06:51 UTC] No.35738285{3}[source]▶

>>35738115 #

For the case of comparing integers, and you want to know if A - B >= 0, you can take A - B, Arithmetic Shift Right 31 bits, then you have 0 if Greater Than or Equal and -1 if Less Than. From there you can just AND that with your displacement.

Note that it's not a comparison on A and B, it's a comparison on A - B, which must not overflow.

9. jeffreygoesto ◴[28 Apr 23 06:54 UTC] No.35738304{5}[source]▶

>>35738262 #

How do you get such am image please?

replies(1): >>35738352 #

10. 2102922286 ◴[28 Apr 23 07:03 UTC] No.35738352{6}[source]▶

>>35738304 #

https://uops.info/uiCA.html it's really cool!!

11. kevingadd ◴[28 Apr 23 07:03 UTC] No.35738355{5}[source]▶

>>35738262 #

Wow, the dispatch queue is so deep! The trace makes a lot of sense, thanks for sharing it. I must have been unclear if it sounded like I was saying cmov was a bottleneck - my argument is just that it's not going to be especially faster than a branch in the general case.

Great point that memory reads are going to dominate a binary search - at that point the CPU will be spending a lot more time waiting for memory than it is spending executing your branches or cmovs.

12. Sesse__ ◴[28 Apr 23 07:35 UTC] No.35738559{4}[source]▶

>>35738206 #

> IMO, cmov was far more valuable back when it was new, because branches were more expensive then.

Branches in general are cheaper now (since they are generally predicted better), but mispredicted branches are more expensive. They are similar in terms of cycles (IIRC Pentium 4 and Skylake are both typically around 20 cycles if you don't get second-order effects tacked on), but you generally get a lot more done in those cycles on a newer CPU, so they are much more important than they used to be. And a binary search will, almost by definition, tend to have 50% mispredicted branches on the compare (i.e., entirely unpredictable, so no better than random chance).

13. Paul-Craft ◴[28 Apr 23 15:05 UTC] No.35742166{3}[source]▶

>>35738115 #

Thank you for this comment! I was sitting here, metaphorically scratching my head, trying to figure out how the hell you can call this branchless when there's clearly an `if` in there:

        for (step /= 2; step != 0; step /= 2) {
        if (compare(begin[step], value))
            begin += step;
        }

Your comment, especially this part, made it click for me:

> You can achieve the same effect as cmov by using bitwise operations and a subtraction, though it'd just be a few cycles slower--but it would be even more clear that it doesn't branch.

BTW, I hate how the author doesn't use braces with this `if` statement. Many a production bug, including one in OSX, as I recall, have occurred because someone didn't want to type a couple extra characters.

replies(1): >>35744198 #

14. secondcoming ◴[28 Apr 23 17:12 UTC] No.35744198{4}[source]▶

>>35742166 #

Most topics I've read/viewed on branchless stuff seems to boil down to doing arithmetic with bools. So your code above can be thought of as:

   bool b = compare(begin[step], value);
   begin += (step * b);

replies(1): >>35744463 #

15. Paul-Craft ◴[28 Apr 23 17:32 UTC] No.35744463{5}[source]▶

>>35744198 #

Yeah, I got that. That was the part u/2102922286's comment made click for me. I guess I just don't do enough low level stuff for that to be something that automatically occurs to me.