←back to thread

386 points ingve | 3 comments | | HN request time: 0.536s | source
1. chaboud ◴[] No.35741301[source]
“Those spikes for std::lower_bound are on powers of two, where it is somehow much slower. I looked into it a little bit but can’t come up with an easy explanation. The Clang version has the same spikes even though it compiles to very different assembly.”

I saw this and immediately went “oh, those look like Intel hardware”.

Intel uses 12-bit memory port quick addressing in their hardware, resulting in an issue known as “4K Aliasing”. When addresses are the same modulo 4K, it causes a collision that has to be mitigated by completing the associated prior memory operation to free up the use of the address in the load/store port system, effectively serializing operations and making performance very dependent on the data stride.

I first bumped up against this when running vertical passes of image processing algorithms that got very slow at certain image sizes, a problem that could be avoided by using an oversized buffer and correspondingly oversized per-line “pitch” to diagonally offset aliased addresses (at a small cost to inter-line cache line overlap).

replies(2): >>35746401 #>>35753879 #
2. simplotek ◴[] No.35746401[source]
What a superb post. Thank you for this gem.
3. touisteur ◴[] No.35753879[source]
For those interested in a deeper dive, Richard Startin had a nice post on the topic https://richardstartin.github.io/posts/4k-aliasing