The Weird Concept of Branchless Programming

(sanixdk.xyz)

170 points judicious | 1 comments | 28 Sep 25 16:40 UTC | HN request time: 0.201s | source

Show context

sfilmeyer ◴[28 Sep 25 17:54 UTC] No.45406379[source]▶

I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

replies(2): >>45406733 #>>45407659 #

gizmo686 ◴[28 Sep 25 20:25 UTC] No.45407659[source]▶

>>45406379 #

Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

  == Benchmarking ABS ==
  ABS (branch):     0.260 sec
  ABS (branchless): 0.264 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     0.332 sec 
  CLAMP (branchless): 0.538 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.043 sec
  PARTITION (branchless): 0.091 sec

Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

I had to drop down to O0 to see branchless be faster in any case:

  == Benchmarking ABS ==
  ABS (branch):     0.743 sec
  ABS (branchless): 0.948 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     4.275 sec
  CLAMP (branchless): 1.429 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.156 sec
  PARTITION (branchless): 0.164 sec

replies(2): >>45408287 #>>45418236 #

1. Roxxik ◴[28 Sep 25 21:36 UTC] No.45408287[source]▶

>>45407659 #

I also tried myself, on different array sizes, with more iterations. The branchy version is not strictly worse.

https://gist.github.com/Stefan-JLU/3925c6a73836ce841860b55c8...

↑