←back to thread

170 points judicious | 1 comments | | HN request time: 0s | source
Show context
sfilmeyer ◴[] No.45406379[source]
I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.
replies(2): >>45406733 #>>45407659 #
gizmo686 ◴[] No.45407659[source]
Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

  == Benchmarking ABS ==
  ABS (branch):     0.260 sec
  ABS (branchless): 0.264 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     0.332 sec 
  CLAMP (branchless): 0.538 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.043 sec
  PARTITION (branchless): 0.091 sec
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

I had to drop down to O0 to see branchless be faster in any case:

  == Benchmarking ABS ==
  ABS (branch):     0.743 sec
  ABS (branchless): 0.948 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     4.275 sec
  CLAMP (branchless): 1.429 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.156 sec
  PARTITION (branchless): 0.164 sec
replies(2): >>45408287 #>>45418236 #
1. Someone ◴[] No.45418236[source]
> I had to drop down to O0 to see branchless be faster in any case

Did you check whether your branchy code actually still was branchy after the compiler processed it at higher optimization levels?