Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

(box86.org)

366 points pabs3 | 4 comments | 27 Aug 24 04:23 UTC | HN request time: 1.051s | source

Show context

Manfred ◴[27 Aug 24 08:31 UTC] No.41365540[source]▶

> At least in the context of x86 emulation, among all 3 architectures we support, RISC-V is the least expressive one.

RISC was explained to me as a reduced instruction set computer in computer science history classes, but I see a lot of articles and proposed new RISC-V profiles about "we just need a few more instructions to get feature parity".

I understand that RISC-V is just a convenient alternative to other platforms for most people, but does this also mean the RISC dream is dead?

replies(7): >>41365583 #>>41365644 #>>41365687 #>>41365974 #>>41366364 #>>41370373 #>>41370588 #

flanked-evergl ◴[27 Aug 24 08:41 UTC] No.41365583[source]▶

>>41365540 #

Is there a RISC dream? I think there is an efficiency "dream", there is a performance "dream", there is a cost "dream" — there are even low-complexity relative to cost, performance and efficiency "dreams" — but a RISC dream? Who cares more about RISC than cost, performance, efficiency and simplicity?

replies(2): >>41365643 #>>41366020 #

Joker_vD ◴[27 Aug 24 10:26 UTC] No.41366020[source]▶

>>41365583 #

There was such dream. It was about getting the mind-bogglingly simple CPU, put caches into the now empty place where all the control logic used to be, and clock it up the wazoo, and let the software deal with load/branch delays, efficiently using all 64 registers, etc. That'll beat the hell out of those silly CISC architectures at performance, and at the fraction of the design and production costs!

This didn't work out, for two main reasons: first, just being able to turn clocks hella high is still not enough to get great performance: you really do want your CPU to be super-scalar, out-of-order, and with great branch predictor, if you need amazing performance. But when you do all that, the simplicity of RISC decoding stops mattering all that much, as Pentium II demonstrated when it equalled DEC Alpha on performance, while still having practically useful things like e.g. byte loads/stores. Yes, it's RISC-like instructions under the hood but that's an implementation detail, no reason to expose it to the user in the ISA, just as you don't have to expose the branch delay slots in your ISA because it's a bad idea to do so: e.g. MIPS II added 1 additional pipeline stage, and now they needed two branch/load delay slots. Whoops! So they added interlocks anyway (MIPS originally stood for "Microprocessor without Interlocked Pipelined Stages", ha-ha) and got rid of the load delays; they still left 1 branch delay slot exposed due to backwards compatibility, and the circuitry required was arguably silly.

The second reason was that the software (or compilers, to be more precise) can't really deal very well with all that stuff from the first paragraph. That's what sank Itanium. That's why nobody makes CPUs with register windows any more. And static instruction scheduling in the compilers still can't beat dynamic instruction reordering.

replies(3): >>41366206 #>>41367836 #>>41368474 #

vlovich123 ◴[27 Aug 24 15:14 UTC] No.41368474[source]▶

>>41366020 #

To add on to what the sibling said, ignoring that CISC chips have a separate frontend to break complex instructions down into an internal RISC-like instruction set and thus the difference is blurred, more RISC instruction sets do tend to win on performance and power for the main reason that the instruction set has a fixed width. This means that you can fetch a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel whereas x86’d variableness makes it harder to keep the super scalar pipeline full (it’s decoder is significantly more complex to try to still extract parallelism which further slows it down). This is a bit more complex on ARM (and maybe RISCV?) where you have two widths but even then in practice it’s easier to extract performance out of it because x86 can be anywhere from 1-4 bytes (or 1-8? Can’t remember) which makes it hard to find boundary instructions in parallel.

There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).

replies(1): >>41368730 #

1. Joker_vD ◴[27 Aug 24 15:35 UTC] No.41368730[source]▶

>>41368474 #

x86 instruction lengths range from 1 to 15.

> a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel

In practice, ARM processors decode up to 4 instructions in parallel; so do Intel and AMD.

replies(1): >>41369863 #

2. adgjlsfhk1 ◴[27 Aug 24 16:59 UTC] No.41369863[source]▶

>>41368730 (TP) #

Apple's m1 chips are 8 wide. and AMD and Intel's newest chips are also doing more fancy things than 4 wide

replies(1): >>41370753 #

3. vlovich123 ◴[27 Aug 24 18:05 UTC] No.41370753[source]▶

>>41369863 #

Any reading resources? I’d love to learn better the techniques they’re using to get better parsllelism. The most obvious solution I can imagine is that they’d just try to brute force starting to execute every possible boundary and rely on it either decoding an invalid instruction or late latching the result until it got confirmed that it was a valid instruction boundary. Is that generally the technique or are they doing more than even that? The challenge with this technique of course is that you risk wasting energy & execution units on phantom stuff vs an architecture that didn’t have as much phantomness potential in the first place.

replies(1): >>41374980 #

4. adgjlsfhk1 ◴[28 Aug 24 00:57 UTC] No.41374980{3}[source]▶

>>41370753 #

https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... is a pretty good overview of the microarchitecture. I don't think they say how they get there, because trade secrets.

↑