Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?
replies(2):
You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.
With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.
- Function A calls function B, which returns some struct S (for instance on the stack).
- B writes S by individual (small) stores.
- A wants to copy S from some place to another (e.g. store it in some other struct).
- LLVM coalesces the individual loads/stores needed to copy S, into one or a series of large operations (e.g. 128-bit SSE2 loads+stores).
- These large loads are issued while the small stores from B are still pending, and necessarily overlap them.
Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.