Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

1. jokoon ◴[27 Aug 24 13:39 UTC] No.41367403[source]▶

Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?

replies(3): >>41367499 #>>41368262 #>>41370208 #

2. dzaima ◴[27 Aug 24 13:50 UTC] No.41367499[source]▶

>>41367403 (TP) #

RISC-V with the compressed instruction extension actually ends up smaller than x86-64 and ARM on average.

There's not much inherent that needs to change in software approach. Probably the biggest thing vs x86-64 is the availability of 32 registers (vs 16 on x86-64), allowing for more intermediate values before things start spilling to stack, which also applies to ARM (which too has 32 registers). But generally it doesn't matter unless you're micro-optimizing.

More micro-optimization things might include:

- The vector extension (aka V or RVV) isn't in the base rv64gc ISA, so you might not get SIMD optimizations depending on the target; whereas x86-64 and aarch64 have SSE2 and NEON (128-bit SIMD) in their base.

- Similarly, no popcount & count leading/trailing zeroes in base rv64gc (requires Zbb); base x86-64 doesn't have popcount, but does have clz/ctz. aarch64 has all.

- Less efficient branchless select, i.e. "a ? b : c"; takes ~4-5 instrs on base rv64gc, 3 with Zicond, but 1 on x86-64 and aarch64. Some hardware can also fuse a jump over a mv instruction to be effectively branchless, but that's even more target-specific.

RISC-V profiles kind of solve the first two issues (e.g. Android requires rva23, which requires rvv & Zbb & Zicond among other things) but if linux distros decide to target rva20/rv64gc then they're ~forever stuck without having those extensions in precompiled code that hasn't bothered with dynamic dispatch. Though this is a problem with x86-64 too (much less so with ARM as it doesn't have that many extensions; SVE is probably the biggest thing by far, and still not supported widely (i.e. Apple silicon doesn't)).

replies(1): >>41367683 #

3. packetlost ◴[27 Aug 24 14:07 UTC] No.41367683[source]▶

>>41367499 #

That seems like something the compiler would generally handle, no? Obviously that doesn't apply everywhere, but in the general case it should.

replies(2): >>41367742 #>>41368271 #

4. dzaima ◴[27 Aug 24 14:12 UTC] No.41367742{3}[source]▶

>>41367683 #

It's something that the compiler would handle, but can still moderately influence programming decisions, i.e. you can have a lot more temporary variables before things start slowing down due to spill stores/loads (esp. in, say, a loop with function calls, as more registers also means more non-volatile registers (i.e. those that are guaranteed to not change across function calls)). But, yes, very limited impact even then.

replies(1): >>41368021 #

5. packetlost ◴[27 Aug 24 14:36 UTC] No.41368021{4}[source]▶

>>41367742 #

It's certainly something I would take into consideration when making a (language) runtime, but probably not at all during all but the most performance sensitive of applications. Certainly a difference, but far lower level than what most applications require

replies(1): >>41368149 #

6. dzaima ◴[27 Aug 24 14:46 UTC] No.41368149{5}[source]▶

>>41368021 #

Yep. Unfortunately I am one to be making language runtimes :)

It's just the potentially most significant thing I could come up with at first. Though perhaps RVV not being in rva20/rv64gc is more significant.

replies(1): >>41368734 #

7. cesarb ◴[27 Aug 24 14:56 UTC] No.41368262[source]▶

>>41367403 (TP) #

> Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

Most of the time, nothing; code correctly written on higher-level languages like C should work the same. The biggest difference, the weaker memory model, is something you also have on most non-x86 architectures like ARM (and your code shouldn't be depending on having a strong memory model in the first place).

> I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

For historical reasons, executable code density on x86 is not that good, so the executable size won't increase as much as you'd expect; both RISC-V with its compressed instructions extension and 32-bit ARM with its Thumb extensions are fairly compact (there was an early RISC-V paper which did that code size comparison, if you want to find out more).

> I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?

What matters most is not CISC vs RISC, but the presence and quality of things like vector instructions and cryptography extensions. Some kinds of software like video encoding and decoding heavily depend on vector instructions to have good performance, and things like full disk encryption or hashing can be helped by specialized instructions to accelerate specific algorithms like AES and SHA256.

8. vlovich123 ◴[27 Aug 24 14:57 UTC] No.41368271{3}[source]▶

>>41367683 #

Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

But for an emulator like this, box64 has to pick how to emulate vectorized instructions on RiscV (eg slowly using scalars or trying to reimplement using native vector instructions). The challenge of course is that typically you don’t get as good a performance unless the emulator can actually rewrite the code on the fly because a 1:1 mapping is going to be suboptimal vs noticing patterns of high level operations being performed and providing a more optimized implementation that replaces an alternate chunk of instructions at once instead to account for implementation differences on the chip (eg you may have to emulate missing instructions but a rewriter could skip emulation if there’s an alternate way to accomplish the same high level computation)

The biggest challenge for something like this from a performance perspective of course will be translating the GPU stuff efficiently to hit the native driver code and that Riscv likely is relying on OSS GPU drivers (and maybe wine to add another translation layer if the game is windows only )

replies(4): >>41368434 #>>41368685 #>>41368702 #>>41370359 #

9. dzaima ◴[27 Aug 24 15:10 UTC] No.41368434{4}[source]▶

>>41368271 #

On clang, you can actually request that it gives a warning on missed vectorization of a given loop with "#pragma clang loop vectorize(enable)": https://godbolt.org/z/sP7drPqMT (and you can even make it an error).

There's even "#pragma clang loop vectorize(assume_safety)" to tell it that pointer aliasing won't be an issue (gcc has a similar "#pragma GCC ivdep"), which should get rid of most odd reasons for missed vectorization.

10. packetlost ◴[27 Aug 24 15:32 UTC] No.41368685{4}[source]▶

>>41368271 #

> Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

Right, but most of the time those are architecture specific and RVV 1.0 is substantially different than say, NEON or SSE2, so you need to change it anyways. You also typically use specialized registers for those, not the general purpose registers. I'm not saying there isn't work to be done (especially in for an application like this one, that is extremely performance sensitive), I'm saying that most applications won't have these problems are be so sensitive that register spills matter much if at all.

replies(1): >>41370815 #

11. tormeh ◴[27 Aug 24 15:33 UTC] No.41368702{4}[source]▶

>>41368271 #

I'd assume it uses RADV, same as the Steam Deck. For most workloads that's faster than AMD's own driver. And yes, it uses Wine and DXVK. As dar as the game is concerned it's running on a DirectX-capable x86 Windows machine. That's a lot of translation layers.

12. packetlost ◴[27 Aug 24 15:36 UTC] No.41368734{6}[source]▶

>>41368149 #

Looks like an APL project? That's really cool!

13. Pet_Ant ◴[27 Aug 24 17:23 UTC] No.41370208[source]▶

>>41367403 (TP) #

No, any ISA pretty much should be equally good for any type of workload. If you are doing assembly programming then it makes a difference but if you were doing something in Python or Unity it really isn’t going to matter.

This is more about being free of ARM’s patents and getting a fresh start using the lessons learned

14. fngjdflmdflg ◴[27 Aug 24 17:36 UTC] No.41370359{4}[source]▶

>>41368271 #

I read somewhere that since floating point addition is not associative the compiler will not autovectorize because the order might change.

replies(1): >>41370773 #

15. vlovich123 ◴[27 Aug 24 18:07 UTC] No.41370773{5}[source]▶

>>41370359 #

It’s somewhat more complicated than that (& presumed your hot path is floating point instead of integral), but that can be a consideration.

replies(1): >>41372962 #

16. vlovich123 ◴[27 Aug 24 18:10 UTC] No.41370815{5}[source]▶

>>41368685 #

I’m highlighting that the compiler doesn’t automatically take care of vector code quite as automatically and as well as it does register allocation and instruction selection which are slightly more solved problems. And it’s easy to imagine that a compiler will fail to optimize a piece of code as well on something that’s architecturally quite novel. RISCV and ARM aren’t actually hugely dissimilar architectures at a high level that completely different optimization need to be written and even selectively weighted by architecture, but I imagine something like a Mill CPU might require quite a reimagining to get anything approaching optimal performance.

17. fngjdflmdflg ◴[27 Aug 24 20:59 UTC] No.41372962{6}[source]▶

>>41370773 #

What are the other considerations? (assuming we are dealing with FP)

replies(1): >>41374143 #

18. vlovich123 ◴[27 Aug 24 22:53 UTC] No.41374143{7}[source]▶

>>41372962 #

Disclaimer: not an expert here so could be very very wrong. This is just my understanding so happy to be corrected.

Another would be that something like fused multiple add would have different (higher if I recall correctly) precision which violates IEE754 and thus vectorization since default options are standard compliant.

Another is that some math intrinsics are documented to populate errno which would prevent using autovec in paths that have an intrinsic.

There may be other nuances depending on float vs double.

Basically most of the things that make up ffast-math i believe would prevent autovectorization.

replies(1): >>41375002 #

19. dzaima ◴[28 Aug 24 01:02 UTC] No.41375002{8}[source]▶

>>41374143 #

Fused multiply add applies equally to scalar and vectorized code (and C actually allows compilers to fuse them; there's -ffp-contract=off / the FP_CONTRACT pragma to turn that off); the compiler/autovectorizer can trivially just leave multiply & add as separate if so requested (slower than having them fused? perhaps. But no impact at all on scalar vs vector given that both have the same fma applicability).

For <math.h> errno, there's -fno-math-errno; indeed included in -ffast-math, but you don't need the entirety of that mess for this.

Loops with a float accumulator is I believe the only case where -ffast-math is actually required for autovectorizability (and even then iirc there are some sub-flags such that you can get the associativity-assuming optimizations while still allowing NaN/inf).