(sanixdk.xyz)

170 points judicious | 3 comments | 28 Sep 25 16:40 UTC | HN request time: 0s | source

Show context

sfilmeyer ◴[28 Sep 25 17:54 UTC] No.45406379[source]▶

I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

replies(2): >>45406733 #>>45407659 #

1. Joel_Mckay ◴[28 Sep 25 18:38 UTC] No.45406733[source]▶

>>45406379 #

In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

replies(1): >>45407746 #

2. hinkley ◴[28 Sep 25 20:35 UTC] No.45407746[source]▶

>>45406733 (TP) #

Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?

replies(1): >>45408062 #

3. Joel_Mckay ◴[28 Sep 25 21:12 UTC] No.45408062[source]▶

>>45407746 #

Non-overlapping sub-problems may be safely parallelized, and executed out-of-order.

In some architectures, both of the branch code motions are executed in parallel, and one is simply tossed after dependent operations finish. We can't be sure exactly how branch predictors and pre-fetch is implemented as it falls under manufacturer NDA. =3

↑

The Weird Concept of Branchless Programming