Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:
/**
* @brief Defines variable alignment to avoid false sharing.
* @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
* @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
*
* The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
*
* @code{.cpp}
* #if defined(__cpp_lib_hardware_interference_size)
* static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
* #else
* static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
* #endif
* @endcode
*
* That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
* unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
* with `-Wno-interference-size`.
*/
static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...
replies(2):