I've always been curious about just how much Rosetta magic is the implementation and how much is TSO; Prism in Windows 24H2 is also no slouch. If the recompiler is decent at tracing data dependencies it might not have to fence that much on a lot of workloads even without hardware TSO.
This is a misinterpretation of what the author wrote! There is a real and significant performance impact in emulating x86 TSO semantics on non-TSO hardware. What the author argues is that enabling TSO process-wide (like macOS does with Rosetta) resolves this impact but it carries counteracting overhead in non-emulated code (such as the emulator itself or in ARM64EC).
The claimed conclusion is that it's better to optimize TSO emulation itself rather than bruteforce it on the hardware level. The way Microsoft achieved this is by having their compiler generate metadata about code that requires TSO and by using ARM64EC, which forwards any API calls to x86 system libraries to native ARM64 builds of the same libraries. Note how the latter in particular will shift the balance in favor of software-based TSO emulation since a hardware-based feature would slow down the native system libraries.
Without ecosystem control, this isn't feasible to implement in other x86 emulators. We have a library forwarding feature in FEX, but adding libraries is much more involved (and hence currently limited to OpenGL and Vulkan). We're also working on detecting code that needs TSO using heuristics, but even that will only ever get us so far. FEX is mainly used for gaming though, where we have a ton of x86 code that may require TSO (e.g. mono/Unity) but wouldn't be handled by ARM64EC, so the balance may be in favor of hardware TSO either way here.
For reference, this is the paragraph (I think) you were referring to:
> Another common misconception about Rosetta is that it is fast because the hardware enforces Intel memory ordering, something called Total Store Ordering. I will make the argument that TSO is the last thing you want, since I know from experience the emulator has to access its own private memory and none of those memory accesses needs to be ordered. In my opinion, TSO is ar red herring that isn't really improving performance, but it sounds nice on paper.
> In my opinion, TSO is a red herring that isn't really improving performance, but it sounds nice on paper.
That's the author directly saying that TSO isn't the major emulation performance gain that people think it is. You're correct that there are countering effects between TSO's benefits to the emulated code vs. the negative effects on the emulator and other non-emulated code in the same process that are fine running non-TSO, but to users, this distinction doesn't matter. All that matters is the performance of emulated program as a whole.
As for the volatile metadata, you're correct that MSVC inserts additional data to aid the emulation. What's not so great is that:
- It was basically an almost undocumented, silent addition to MSVC.
- In some cases, it will slow down the generated x64 code slightly by adding NOPs where necessary to disambiguate the volatile access metadata.
- It only affects code statically compiled with a recent version of MSVC (late VS2019 or later). It doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.
I think we agree in our understanding, but condensing it down to "TSO isn't as much of a deal as claimed" is misleading:
* Efficient TSO emulation is crucial (both on Windows and elsewhere)
* The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)
* Hardware TSO is still of tremendous value on systems that don't have ecosystem support
> [volatile metadata] doesn't help executables compiled with non-MSVC compilers like Clang, nor any JIT code, nor is there any documentation indicating how to support either of these cases.
That's funny, I hadn't considered third party compilers. Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are. (Same for older titles that were compiled before volatile metadata was added)
Yes, but this is not in contention...? No one is disputing that TSO semantics in the emulated x86 code need to be preserved and that it needs to be done fast, we're talking about the tradeoffs of also having TSO support on the host platform.
> The blog claims hardware TSO is non-ideal on Windows only (because Microsoft adapted the ecosystem to facilitate software-based TSO emulation). (Even then, it's unclear if the author quantified the concrete impact)
> Hardware TSO is still of tremendous value on systems that don't have ecosystem support
That isn't what the author said. From the article:
> Another common misconception about Rosetta is that it is fast because the hardware enforces Intel memory ordering, something called Total Store Ordering. I will make the argument that TSO is the last thing you want, since I know from experience the emulator has to access its own private memory and none of those memory accesses needs to be ordered. In my opinion, TSO is ar red herring that isn't really improving performance, but it sounds nice on paper.
That is a direct statement on Rosetta/macOS and does not mention Prism/Windows. How correct that assessment may be is another matter, but it is not talking about Windows only.
> Those applications would still benefit from ARM64EC (i.e. native system libraries), but the actual application code would be affected quite badly by the TSO impact then, depending on how good their fallback heuristics are.
I will have to check this, I don't think it's that bad. JITted programs run much, much better on my Snapdragon X device than the older Snapdragon 835, but there are a lot of variables there (CPU much faster/wider, Windows 11 Prism vs. Windows 10 emulator, x86 vs x64 emulation). I have a program with native x64/ARM64 builds that runs at -25% speed in emulated x64 vs native ARM64, I'm curious myself to see how it runs with volatile metadata disabled.
The interesting part is when the compatibility settings for the executables are modified to change the default multi-core setting from Fast to Strict Multi-Core Operation. In that mode, the build without volatile metadata runs about 20% slower than the default build. That indicates that the x64 emulator may be taking some liberties with memory ordering by default. Note that while this application is multithreaded, the worker threads do little and it is very highly single thread bottlenecked.