I saw this and immediately went “oh, those look like Intel hardware”.
Intel uses 12-bit memory port quick addressing in their hardware, resulting in an issue known as “4K Aliasing”. When addresses are the same modulo 4K, it causes a collision that has to be mitigated by completing the associated prior memory operation to free up the use of the address in the load/store port system, effectively serializing operations and making performance very dependent on the data stride.
I first bumped up against this when running vertical passes of image processing algorithms that got very slow at certain image sizes, a problem that could be avoided by using an oversized buffer and correspondingly oversized per-line “pitch” to diagonally offset aliased addresses (at a small cost to inter-line cache line overlap).