Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

1. justahuman74 ◴[27 Aug 24 05:31 UTC] No.41364800[source]▶

I hope they're able to get this ISA-level feedback to people at RVI

2. dmitrygr ◴[27 Aug 24 05:38 UTC] No.41364827[source]▶

None of this is new. None of it.

In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).

Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....

replies(2): >>41364944 #>>41366113 #

3. camel-cdr ◴[27 Aug 24 06:01 UTC] No.41364904[source]▶

>>41364800 (TP) #

The scalar efficiency SIG has already been discussing bitfield insert and extract instructions.

We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:

    # a0 = rax, a1 = rbx
    slli t0, a1, 64-8
    rori a0, a0, 16
    add a0, a0, t0
    rori a0, a0, 64-16

[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...

replies(2): >>41365275 #>>41369025 #

4. renox ◴[27 Aug 24 06:12 UTC] No.41364944[source]▶

>>41364827 #

Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'! I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.

What's your take on

1) unaligned 32bit instructions with the C extension?

2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..

replies(3): >>41364991 #>>41365621 #>>41367330 #

5. dmitrygr ◴[27 Aug 24 06:24 UTC] No.41364991{3}[source]▶

>>41364944 #

1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.

2. nobody uses it on mips either, so it is likely of no use.

replies(3): >>41365378 #>>41365968 #>>41366469 #

6. bonzini ◴[27 Aug 24 07:26 UTC] No.41365275[source]▶

>>41364904 #

Nice trick, in fact with 4 instructions it's as efficient as extract/insert and it works for all ADD/SUB/OR/XOR/CMP instructions (not for AND), except if the source is a high-byte register. However it's not really a problem if code generation is not great in this case: compilers in practice will not generate accesses to these registers, and while old 16-bit assembly code has lots of such accesses it's designed to run on processors that ran at 4-20 MHz.

Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...

replies(2): >>41367296 #>>41375144 #

7. bonzini ◴[27 Aug 24 07:50 UTC] No.41365378{4}[source]▶

>>41364991 #

Fixed size instructions are not absolutely necessary, but keeping them naturally aligned is just better even if that means using C instructions a bit less often. It's especially messy that 32-bit instructions can span a page.

8. newpavlov ◴[27 Aug 24 08:51 UTC] No.41365621{3}[source]▶

>>41364944 #

The handling of misaligned loads/stores in RISC-V is also can be considered a disappointing point: https://github.com/riscv/riscv-isa-manual/issues/1611 It oozes with preferring convenience of hardware developers and "flexibility" over making practical guarantees needed by software developers. It looks like the MIPS patent on misaligned load/store instructions has played its negative role. The patent expired in 2019, but it seems we are stuck with the current status quo nevertheless.

9. loup-vaillant ◴[27 Aug 24 10:15 UTC] No.41365968{4}[source]▶

>>41364991 #

> Fast big cores should just stick to fixed size instrs for faster decode.

How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.

The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).

And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.

Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.

replies(3): >>41366051 #>>41366073 #>>41366598 #

10. newpavlov ◴[27 Aug 24 10:34 UTC] No.41366051{5}[source]▶

>>41365968 #

Another argument against the C extension is that it uses a big chunk of the opcode space, which may be better used for other extensions with 32-bit instructions.

replies(1): >>41366354 #

11. ◴[27 Aug 24 10:39 UTC] No.41366073{5}[source]▶

>>41365968 #

12. Findecanor ◴[27 Aug 24 10:46 UTC] No.41366113[source]▶

>>41364827 #

Bitfield-extract is being discussed for a future extension. E.g. Qualcomm is pressing for it to be added.

In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.

However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.

13. camel-cdr ◴[27 Aug 24 11:25 UTC] No.41366354{6}[source]▶

>>41366051 #

Are just 32-bit and naturally aligned 64 bit instruction a better path than fewer 32 bit, but 16/48/64 bit instructions?

I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)

There are essentially two design philosophies:

1. 32-bit instructions, and 64 bit naturally aligned instructions

2. 16/32/48/64 bit instructions with 16 bit alignment

Implementation complexity is debatable, although it seems to somewhat favor options 1:

1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things

2: you need to find instruction starts, and handle decoding instructions that span across a cache line

How big the impact is relative to the entire design is quite unclear.

Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.

In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.

If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.

replies(1): >>41367371 #

14. renox ◴[27 Aug 24 11:40 UTC] No.41366469{4}[source]▶

>>41364991 #

>2. nobody uses it on mips either, so it is likely of no use.

Sure but at the time Rust, Zig didn't exist, these two languages have a mode which detects integer overflow..

15. inkyoto ◴[27 Aug 24 11:59 UTC] No.41366598{5}[source]▶

>>41365968 #

Frankly, there is no advantage to compressed instructions in a high performance CPU core as a misaligned instruction can span a memory page boundary, which will generate a memory fault, potentially a TLB flush, and, if the memory page is not resident in memory, will require an I/O operation. Which is much worse than crossing a cache line. It is a double whammy when both occur simultaneously.

One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.

replies(2): >>41369523 #>>41371694 #

16. ptitSeb ◴[27 Aug 24 13:28 UTC] No.41367296{3}[source]▶

>>41365275 #

Actually, Box64 can also store operands for later computation, depending on what comes next...

17. phkahler ◴[27 Aug 24 13:31 UTC] No.41367330{3}[source]▶

>>41364944 #

IMHO they made a mistake by not allowing immediate data to follow instructions. You could encode 8 bit constants within the opcode, but anything larger should be properly supported with immediate data. As for the C extension, I think that was also inferior because it was added afterward. I'd like to see a re-encoding of the entire ISA in about 10 years once things are really stable.

replies(1): >>41368901 #

18. newpavlov ◴[27 Aug 24 13:36 UTC] No.41367371{7}[source]▶

>>41366354 #

There is also a middle ground of requiring to pad 16/48-bit sequences with 16-bit NOP to align them to 32 bits. I agree that at this time it's not clear whether the C extension is a good idea or not (same with the V extension).

replies(1): >>41368450 #

19. sweetjuly ◴[27 Aug 24 15:12 UTC] No.41368450{8}[source]▶

>>41367371 #

The C extension authors did consider requiring alignment/padding to prevent the misaligned 32-bit instruction issues, but they specifically mention rejecting it since it ate up all the code size savings.

replies(1): >>41369435 #

20. dmitrygr ◴[27 Aug 24 15:52 UTC] No.41368901{4}[source]▶

>>41367330 #

The main problem with what you’re saying is that none of the lessons learned are new. They were all well-known before this ISA was designed, so if the designers had any intention of learning from the past, they had every opportunity to do so.

21. ksco ◴[27 Aug 24 16:04 UTC] No.41369025[source]▶

>>41364904 #

Author here, we have adopted this approach as a fast path to box64: https://github.com/ptitSeb/box64/pull/1763, thank you very much!

22. Dylan16807 ◴[27 Aug 24 16:32 UTC] No.41369435{9}[source]▶

>>41368450 #

Did they specifically analyze doing alignment on a cache line basis?

replies(2): >>41370170 #>>41374388 #

23. Dylan16807 ◴[27 Aug 24 16:37 UTC] No.41369523{6}[source]▶

>>41366598 #

> One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

If it's in the linker then tracking pages sounds pretty doable.

You don't need to care about multiple page sizes. If you pad at the minimum page size, or even at 1KB boundaries, that's a miniscule number of NOPs.

24. adgjlsfhk1 ◴[27 Aug 24 17:20 UTC] No.41370170{10}[source]▶

>>41369435 #

that seems really tough for compilers.

replies(1): >>41370228 #

25. dmitrygr ◴[27 Aug 24 17:24 UTC] No.41370228{11}[source]▶

>>41370170 #

Not really. Most modern x86 compilers already align jump targets to cache line boundaries since this helps x86 a lot. So it is doable. If you compile each function into a section (common), then the linker can be told to align them to 64 or 128 bytes easily. Code size would grow (but tetris can be played to reduce this by packing functions)

26. dzaima ◴[27 Aug 24 19:15 UTC] No.41371694{6}[source]▶

>>41366598 #

Page crossing affects a minuscule amount of cases - with 4096B pages and 100% non-compressed instructions (but still somehow 50% of the time misaligned), it affects only one in 2048 instructions.

The possibility of I/O is in no way exclusive to compressed instructions. If the page-crossing instruction was padded, the second page would need to be faulted in required anyway. All that matters is number of pages of code needed for the piece of code, which is simply just code size.

The only case that actually has a chance of mattering simply is just crossing cachelines.

And I would imagine high-performance cores would have some internal instruction buffer anyway, for doing cross-fetch-block instruction fusion and whatnot.

27. sweetjuly ◴[27 Aug 24 23:28 UTC] No.41374388{10}[source]▶

>>41369435 #

This would require specifying a cache line size in the ABI, which is a somewhat odd uarch detail to bubble up. While 64-bytes is conventional for large application processors and has been for a long time, I wouldn't want to make it a requirement.

replies(1): >>41375502 #

28. brucehoult ◴[28 Aug 24 01:32 UTC] No.41375144{3}[source]▶

>>41365275 #

> except if the source is a high-byte register

That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.

And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".

29. Dylan16807 ◴[28 Aug 24 02:47 UTC] No.41375502{11}[source]▶

>>41374388 #

It's definitely worth analyzing though.

See how big of a block you need to get 90% of the compression benefit, etc.