Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.
At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.
For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:
== Benchmarking ABS ==
ABS (branch): 0.260 sec
ABS (branchless): 0.264 sec
== Benchmarking CLAMP ==
CLAMP (branch): 0.332 sec
CLAMP (branchless): 0.538 sec
== Benchmarking PARTITION ==
PARTITION (branch): 0.043 sec
PARTITION (branchless): 0.091 sec
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).
I had to drop down to O0 to see branchless be faster in any case:
== Benchmarking ABS ==
ABS (branch): 0.743 sec
ABS (branchless): 0.948 sec
== Benchmarking CLAMP ==
CLAMP (branch): 4.275 sec
CLAMP (branchless): 1.429 sec
== Benchmarking PARTITION ==
PARTITION (branch): 0.156 sec
PARTITION (branchless): 0.164 sec