←back to thread

170 points judicious | 3 comments | | HN request time: 0s | source
Show context
sfilmeyer ◴[] No.45406379[source]
I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.
replies(2): >>45406733 #>>45407659 #
1. Joel_Mckay ◴[] No.45406733[source]
In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

replies(1): >>45407746 #
2. hinkley ◴[] No.45407746[source]
Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?
replies(1): >>45408062 #
3. Joel_Mckay ◴[] No.45408062[source]
Non-overlapping sub-problems may be safely parallelized, and executed out-of-order.

In some architectures, both of the branch code motions are executed in parallel, and one is simply tossed after dependent operations finish. We can't be sure exactly how branch predictors and pre-fetch is implemented as it falls under manufacturer NDA. =3