Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

(box86.org)

Show context

justahuman74 ◴[27 Aug 24 05:31 UTC] No.41364800[source]▶

>>41364549 (OP) #

I hope they're able to get this ISA-level feedback to people at RVI

replies(2): >>41364827 #>>41364904 #

dmitrygr ◴[27 Aug 24 05:38 UTC] No.41364827[source]▶

>>41364800 #

None of this is new. None of it.

In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).

Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....

replies(2): >>41364944 #>>41366113 #

renox ◴[27 Aug 24 06:12 UTC] No.41364944[source]▶

>>41364827 #

Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'! I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.

What's your take on

1) unaligned 32bit instructions with the C extension?

2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..

replies(3): >>41364991 #>>41365621 #>>41367330 #

dmitrygr ◴[27 Aug 24 06:24 UTC] No.41364991[source]▶

>>41364944 #

1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.

2. nobody uses it on mips either, so it is likely of no use.

replies(3): >>41365378 #>>41365968 #>>41366469 #

loup-vaillant ◴[27 Aug 24 10:15 UTC] No.41365968[source]▶

>>41364991 #

> Fast big cores should just stick to fixed size instrs for faster decode.

How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.

The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).

And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.

Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.

replies(3): >>41366051 #>>41366073 #>>41366598 #

newpavlov ◴[27 Aug 24 10:34 UTC] No.41366051[source]▶

>>41365968 #

Another argument against the C extension is that it uses a big chunk of the opcode space, which may be better used for other extensions with 32-bit instructions.

replies(1): >>41366354 #

camel-cdr ◴[27 Aug 24 11:25 UTC] No.41366354[source]▶

>>41366051 #

Are just 32-bit and naturally aligned 64 bit instruction a better path than fewer 32 bit, but 16/48/64 bit instructions?

I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)

There are essentially two design philosophies:

1. 32-bit instructions, and 64 bit naturally aligned instructions

2. 16/32/48/64 bit instructions with 16 bit alignment

Implementation complexity is debatable, although it seems to somewhat favor options 1:

1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things

2: you need to find instruction starts, and handle decoding instructions that span across a cache line

How big the impact is relative to the entire design is quite unclear.

Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.

In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.

If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.

replies(1): >>41367371 #

1. newpavlov ◴[27 Aug 24 13:36 UTC] No.41367371[source]▶

>>41366354 #

There is also a middle ground of requiring to pad 16/48-bit sequences with 16-bit NOP to align them to 32 bits. I agree that at this time it's not clear whether the C extension is a good idea or not (same with the V extension).

replies(1): >>41368450 #

2. sweetjuly ◴[27 Aug 24 15:12 UTC] No.41368450[source]▶

>>41367371 (TP) #

The C extension authors did consider requiring alignment/padding to prevent the misaligned 32-bit instruction issues, but they specifically mention rejecting it since it ate up all the code size savings.

replies(1): >>41369435 #

3. Dylan16807 ◴[27 Aug 24 16:32 UTC] No.41369435[source]▶

>>41368450 #

Did they specifically analyze doing alignment on a cache line basis?

replies(2): >>41370170 #>>41374388 #

4. adgjlsfhk1 ◴[27 Aug 24 17:20 UTC] No.41370170{3}[source]▶

>>41369435 #

that seems really tough for compilers.

replies(1): >>41370228 #

5. dmitrygr ◴[27 Aug 24 17:24 UTC] No.41370228{4}[source]▶

>>41370170 #

Not really. Most modern x86 compilers already align jump targets to cache line boundaries since this helps x86 a lot. So it is doable. If you compile each function into a section (common), then the linker can be told to align them to 64 or 128 bytes easily. Code size would grow (but tetris can be played to reduce this by packing functions)

6. sweetjuly ◴[27 Aug 24 23:28 UTC] No.41374388{3}[source]▶

>>41369435 #

This would require specifying a cache line size in the ABI, which is a somewhat odd uarch detail to bubble up. While 64-bytes is conventional for large application processors and has been for a long time, I wouldn't want to make it a requirement.

replies(1): >>41375502 #

7. Dylan16807 ◴[28 Aug 24 02:47 UTC] No.41375502{4}[source]▶

>>41374388 #

It's definitely worth analyzing though.

See how big of a block you need to get 90% of the compression benefit, etc.

↑