Snapdragon X2 Elite ARM Laptop CPU

(www.qualcomm.com)

Show context

drewg123 ◴[24 Sep 25 23:14 UTC] No.45367075[source]▶

Does anybody know if the X2 supports the x86 Total store ordering (TSO) memory ordering model? That's how Apple silicon does such efficient emulation of x86. I'd think that would be even MORE important for a Windows ARM64 laptop where there is so much more legacy x86 software going back decades.

replies(2): >>45367272 #>>45367322 #

bri3d ◴[24 Sep 25 23:42 UTC] No.45367322[source]▶

>>45367075 #

Does anyone have benchmarks for Rosetta with TSO vs the Linux version with no-TSO? I guess it might be a bit challenging to achieve apples to apples, although you could run a test benchmark on OSX and then Asahi on the same hardware, I think?

I've always been curious about just how much Rosetta magic is the implementation and how much is TSO; Prism in Windows 24H2 is also no slouch. If the recompiler is decent at tracing data dependencies it might not have to fence that much on a lot of workloads even without hardware TSO.

replies(3): >>45368478 #>>45370784 #>>45374073 #

1. ack_complete ◴[25 Sep 25 02:01 UTC] No.45368478[source]▶

>>45367322 #

People who have worked on the Windows x64 emulator claim that TSO isn't as much of a deal as claimed, other factors like enhanced hardware flag conversion support and function call optimizations play a significant role too:

http://www.emulators.com/docs/abc_exit_xta.htm

replies(2): >>45369177 #>>45369773 #

2. bri3d ◴[25 Sep 25 04:06 UTC] No.45369177[source]▶

>>45368478 (TP) #

This is more like what I’d expect! This is a great article too, thank you, this is the kind of thing I come to HN for :)

3. neobrain ◴[25 Sep 25 06:14 UTC] No.45369773[source]▶

>>45368478 (TP) #

> People who have worked on the Windows x64 emulator claim that TSO isn't as much of a deal as claimed

This is a misinterpretation of what the author wrote! There is a real and significant performance impact in emulating x86 TSO semantics on non-TSO hardware. What the author argues is that enabling TSO process-wide (like macOS does with Rosetta) resolves this impact but it carries counteracting overhead in non-emulated code (such as the emulator itself or in ARM64EC).

The claimed conclusion is that it's better to optimize TSO emulation itself rather than bruteforce it on the hardware level. The way Microsoft achieved this is by having their compiler generate metadata about code that requires TSO and by using ARM64EC, which forwards any API calls to x86 system libraries to native ARM64 builds of the same libraries. Note how the latter in particular will shift the balance in favor of software-based TSO emulation since a hardware-based feature would slow down the native system libraries.

Without ecosystem control, this isn't feasible to implement in other x86 emulators. We have a library forwarding feature in FEX, but adding libraries is much more involved (and hence currently limited to OpenGL and Vulkan). We're also working on detecting code that needs TSO using heuristics, but even that will only ever get us so far. FEX is mainly used for gaming though, where we have a ton of x86 code that may require TSO (e.g. mono/Unity) but wouldn't be handled by ARM64EC, so the balance may be in favor of hardware TSO either way here.

For reference, this is the paragraph (I think) you were referring to:

> Another common misconception about Rosetta is that it is fast because the hardware enforces Intel memory ordering, something called Total Store Ordering. I will make the argument that TSO is the last thing you want, since I know from experience the emulator has to access its own private memory and none of those memory accesses needs to be ordered. In my opinion, TSO is ar red herring that isn't really improving performance, but it sounds nice on paper.

replies(1): >>45373247 #

4. ack_complete ◴[25 Sep 25 14:46 UTC] No.45373247[source]▶

>>45369773 #

How is it a misinterpretation? To re-quote that last sentence:

> In my opinion, TSO is a red herring that isn't really improving performance, but it sounds nice on paper.

That's the author directly saying that TSO isn't the major emulation performance gain that people think it is. You're correct that there are countering effects between TSO's benefits to the emulated code vs. the negative effects on the emulator and other non-emulated code in the same process that are fine running non-TSO, but to users, this distinction doesn't matter. All that matters is the performance of emulated program as a whole.

As for the volatile metadata, you're correct that MSVC inserts additional data to aid the emulation. What's not so great is that:

- It was basically an almost undocumented, silent addition to MSVC.

- In some cases, it will slow down the generated x64 code slightly by adding NOPs where necessary to disambiguate the volatile access metadata.

- It only affects code statically compiled with a recent version of MSVC (late VS2019 or later). It doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.

replies(1): >>45376672 #

5. neobrain ◴[25 Sep 25 18:14 UTC] No.45376672{3}[source]▶

>>45373247 #

> How is it a misinterpretation? To re-quote that last sentence:

I think we agree in our understanding, but condensing it down to "TSO isn't as much of a deal as claimed" is misleading:

* Efficient TSO emulation is crucial (both on Windows and elsewhere)

* The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)

* Hardware TSO is still of tremendous value on systems that don't have ecosystem support

> [volatile metadata] doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.

That's funny, I hadn't considered third party compilers. Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are. (Same for older titles that were compiled before volatile metadata was added)

replies(2): >>45378301 #>>45381571 #

6. ack_complete ◴[25 Sep 25 20:10 UTC] No.45378301{4}[source]▶

>>45376672 #

> Efficient TSO emulation is crucial (both on Windows and elsewhere)

Yes, but this is not in contention...? No one is disputing that TSO semantics in the emulated x86 code need to be preserved and that it needs to be done fast, we're talking about the tradeoffs of also having TSO support on the host platform.

> The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)

> Hardware TSO is still of tremendous value on systems that don't have ecosystem support

That isn't what the author said. From the article:

That is a direct statement on Rosetta/macOS and does not mention Prism/Windows. How correct that assessment may be is another matter, but it is not talking about Windows only.

> Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are.

I will have to check this, I don't think it's that bad. JITted programs run much, much better on my Snapdragon X device than the older Snapdragon 835, but there are a lot of variables there (CPU much faster/wider, Windows 11 Prism vs. Windows 10 emulator, x86 vs x64 emulation). I have a program with native x64/ARM64 builds that runs at -25% speed in emulated x64 vs native ARM64, I'm curious myself to see how it runs with volatile metadata disabled.

7. ack_complete ◴[26 Sep 25 01:35 UTC] No.45381571{4}[source]▶

>>45376672 #

Following up that last part -- I recompiled my x64 codebase with /volatileMetadata-, which reduced the volatile metadata by ~20K (the remainder most likely from the statically linked CRT). The profiling results were negligible, under noise level between the builds and both about 15-30% below the native ARM64 build.

The interesting part is when the compatibility settings for the executables are modified to change the default multi-core setting from Fast to Strict Multi-Core Operation. In that mode, the build without volatile metadata runs about 20% slower than the default build. That indicates that the x64 emulator may be taking some liberties with memory ordering by default. Note that while this application is multithreaded, the worker threads do little and it is very highly single thread bottlenecked.

replies(1): >>45383126 #

8. neobrain ◴[26 Sep 25 05:51 UTC] No.45383126{5}[source]▶

>>45381571 #

20% is about the general order of magnitude we observed in FEX a while ago, though as you enable all TSO compatibility settings (including those rarely needed) it'll be much higher even. As people elsewhere in the thread mentioned it'd be interesting to see how FEX fares on Asahi with hardware TSO enabled vs disabled (but with conversative TSO emulation as set up by default) since it's less of a blackbox.

↑