Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?
replies(2):
- Function A calls function B, which returns some struct S (for instance on the stack).
- B writes S by individual (small) stores.
- A wants to copy S from some place to another (e.g. store it in some other struct).
- LLVM coalesces the individual loads/stores needed to copy S, into one or a series of large operations (e.g. 128-bit SSE2 loads+stores).
- These large loads are issued while the small stores from B are still pending, and necessarily overlap them.
Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.