←back to thread

Why is Apple Rosetta 2 fast? (2022)

(dougallj.wordpress.com)
172 points fanf2 | 2 comments | | HN request time: 0.404s | source
Show context
Syonyk ◴[] No.42188705[source]
Post got the big one: Total Store Ordering (TSO).

The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.

The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.

replies(6): >>42188819 #>>42189266 #>>42189505 #>>42189556 #>>42189596 #>>42197760 #
vlovich123 ◴[] No.42189556[source]
Is TSO something other than doing atomics with seq_cst?
replies(2): >>42189999 #>>42194624 #
1. j16sdiz ◴[] No.42189999[source]
TSO is what x86 do when you are _not_ using atomics.
replies(1): >>42193237 #
2. adrian_b ◴[] No.42193237[source]
True, and it is a little more relaxed than sequential consistency.

For simple loads and stores, the x86 CPUs do not reorder the loads between themselves or the stores between themselves. Also the stores are not done before previous loads.

Only some special kinds of stores can be reordered, i.e. those caused by string instructions or the stores of vector registers that are marked as NT (non-temporal).

So x86 does not need release stores, any simple store is suitable for this. Also store barriers are not normally needed. Acquire fences a.k.a. acquire barriers are sometimes needed, but much less often than on CPUs with weaker ordering for the memory accesses (for acquire fences both x86 and Arm Aarch64 have confusing mnemonics, i.e. LFENCE on x86 and DMB/DSB of the LD kind on Aarch64; in both cases these instructions are not load fences as suggested by the mnemonics, but acquire fences).

When converting x86 code to Aarch64 code, there are many cases when simple stores must be replaced with release stores (a.k.a. Store-Release instructions in the Arm documentation) and there are many places where acquire barriers must be inserted, or, less frequently, store barriers must be inserted (for non-optimally written concurrent code it may also be necessary to replace some simple loads with Load-Acquire instructions of Aaarch64).