But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
[1] http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...
But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
[1] http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...
https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf (though see https://blog.regehr.org/archives/1515: "This piece (...) explains why Daniel J. Bernstein’s talk, The death of optimizing compilers (audio [http://cr.yp.to/talks/2015.04.16/audio.ogg]) is wrong", citing https://news.ycombinator.com/item?id=9397169)
https://blog.royalsloth.eu/posts/the-compiler-will-optimize-...
http://lua-users.org/lists/lua-l/2011-02/msg00742.html
https://web.archive.org/web/20150213004932/http://x264dev.mu...
Daniel Berlin's https://news.ycombinator.com/item?id=9397169 does kind of disagree, saying, "If GCC didn't beat an expert at optimizing interpreter loops, it was because they didn't file a bug and give us code to optimize," but his actual example is the CPython interpreter loop, which is light-years from the kind of hand-optimized assembly interpreter Mike Pall's post is talking about, and moreover it wasn't feeding an interpreter loop to GCC but rather replacing interpretation with run-time compilation. Mostly what he disagrees about is the same thing Regehr disagrees about: whether there's enough code in the category of "not worth hand-optimizing but still runs often enough to matter", not whether you can beat a compiler by hand-optimizing your code. On the contrary, he brings up whole categories of code where compilers can't hope to compete with hand-optimization, such as numerical algorithms where optimization requires sacrificing numerical stability. mpweiher's comment in response discusses other scenarios where compilers can't hope to compete, like systems-level optimization.
It's worth reading the comments by haberman and Mike Pall in the HN thread there where they correct Berlin about LuaJIT, and kjksf also points out a number of widely-used libraries that got 2–4× speedups over optimized C by hand-optimizing the assembly: libjpeg-turbo, Skia, and ffmpeg. It'd be interesting to see if the intervening 10 years have changed the situation, because GCC and LLVM have improved in that time, but I doubt they've improved by even 50%, much less 300%.