Beautiful branchless binary search

“Those spikes for std::lower_bound are on powers of two, where it is somehow much slower. I looked into it a little bit but can’t come up with an easy explanation. The Clang version has the same spikes even though it compiles to very different assembly.”

I saw this and immediately went “oh, those look like Intel hardware”.

Intel uses 12-bit memory port quick addressing in their hardware, resulting in an issue known as “4K Aliasing”. When addresses are the same modulo 4K, it causes a collision that has to be mitigated by completing the associated prior memory operation to free up the use of the address in the load/store port system, effectively serializing operations and making performance very dependent on the data stride.

I first bumped up against this when running vertical passes of image processing algorithms that got very slow at certain image sizes, a problem that could be avoided by using an oversized buffer and correspondingly oversized per-line “pitch” to diagonally offset aliased addresses (at a small cost to inter-line cache line overlap).