Branchless UTF-8 Encoding

1. Dwedit ◴[17 Jan 25 21:41 UTC] No.42743566[source]▶

>>42742184 (OP) #

Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?

replies(2): >>42743614 #>>42745222 #

2. ngoldbaum ◴[17 Jan 25 21:47 UTC] No.42743614[source]▶

>>42743566 (TP) #

You could do two passes over the string, first get the total length in bytes, then fill it in codepoint by codepoint.

You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.

With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.

3. ack_complete ◴[18 Jan 25 02:18 UTC] No.42745222[source]▶

>>42743566 (TP) #

CPUs are surprisingly good at dealing with this in their store queues. I see this write-all-and-increment-some technique used a lot in optimized code, like branchless left-pack routines or overcopying in the copy handler of an LZ/Deflate decompressor.

replies(1): >>42747276 #

4. atq2119 ◴[18 Jan 25 10:19 UTC] No.42747276[source]▶

>>42745222 #

Yep, same with overlapping unaligned loads. It's just fairly cheap to make that stuff pipelined and run fast. It's only when you mix loads and stores in the same memory region that there are conflicts that can slow you down (and then quite horribly actually, depending on the exact processor).

replies(1): >>42748834 #

5. Sesse__ ◴[18 Jan 25 14:59 UTC] No.42748834{3}[source]▶

>>42747276 #

The place where I see this really hurts goes when Clang/LLVM gets too fancy, in situations like this:

  - Function A calls function B, which returns some struct S (for instance on the stack).
  - B writes S by individual (small) stores.
  - A wants to copy S from some place to another (e.g. store it in some other struct).
  - LLVM coalesces the individual loads/stores needed to copy S, into one or a series of large operations (e.g. 128-bit SSE2 loads+stores).
  - These large loads are issued while the small stores from B are still pending, and necessarily overlap them.

Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.

replies(1): >>42755990 #

6. ack_complete ◴[19 Jan 25 11:05 UTC] No.42755990{4}[source]▶

>>42748834 #

This happens with partial autovectorization, too. Compiler fails to vectorize a first loop and then vectorizes the second, result is a store forwarding failure at the start of the second loop trying to read the output of the first loop, erasing the vectorization gains.