←back to thread

Why is Apple Rosetta 2 fast? (2022)

(dougallj.wordpress.com)
172 points fanf2 | 2 comments | | HN request time: 0.4s | source
Show context
Syonyk ◴[] No.42188705[source]
Post got the big one: Total Store Ordering (TSO).

The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.

The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.

replies(6): >>42188819 #>>42189266 #>>42189505 #>>42189556 #>>42189596 #>>42197760 #
jsheard ◴[] No.42188819[source]
It's surprising that (AFAIK) Qualcomm didn't implement TSO in the chips they made for the recent-ish Windows ARM machines. If anything they need fast x86 emulation even more than Apple does since Windows has a much longer tail of software support than macOS, there's going to be important Windows apps that stubbornly refuse to support native ARM basically forever.
replies(8): >>42188869 #>>42188881 #>>42188889 #>>42188901 #>>42189055 #>>42189531 #>>42189551 #>>42193997 #
scottlamb ◴[] No.42188869[source]
Does Windows's translation take advantage of those where they exist? E.g. if I launch an aarch64 Windows VM on my M2, does it use the M2's support for TSO when running x86_64 .exes or does it insert these memory barriers?

If not, it makes sense that Qualcomm didn't bother adding them.

replies(3): >>42188900 #>>42188924 #>>42189541 #
saagarjha ◴[] No.42189541[source]
No because Windows is not aware of how Apple does it. There exist Linux patches documenting how to do so, though.
replies(1): >>42190069 #
1. scottlamb ◴[] No.42190069[source]
The article says the following:

> As far as I know this is not part of the ARM standard, but it also isn’t Apple specific: Nvidia Denver/Carmel and Fujitsu A64fx are other 64-bit ARM processors that also implement TSO (thanks to marcan for these details).

I'm not sure how to interpret that—do these other parameters have distinct/proprietary TSO extensions? Are they referring to a single published (optional) extension that all three implement? The linked tweet has been deleted so no clues there, and I stopped digging.

replies(1): >>42190273 #
2. saagarjha ◴[] No.42190273[source]
Those are just TSO all the time I think. So they are stronger than the ARM requirement