Myths Programmers Believe about CPU Caches (2018)

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;

As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...