In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).
Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....
We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:
# a0 = rax, a1 = rbx
slli t0, a1, 64-8
rori a0, a0, 16
add a0, a0, t0
rori a0, a0, 64-16
[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...What's your take on
1) unaligned 32bit instructions with the C extension?
2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..
2. nobody uses it on mips either, so it is likely of no use.
Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...
How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.
The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).
And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.
Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.
In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.
However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.
I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)
There are essentially two design philosophies:
1. 32-bit instructions, and 64 bit naturally aligned instructions
2. 16/32/48/64 bit instructions with 16 bit alignment
Implementation complexity is debatable, although it seems to somewhat favor options 1:
1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things
2: you need to find instruction starts, and handle decoding instructions that span across a cache line
How big the impact is relative to the entire design is quite unclear.
Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.
In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.
If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.
One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).
The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.
If it's in the linker then tracking pages sounds pretty doable.
You don't need to care about multiple page sizes. If you pad at the minimum page size, or even at 1KB boundaries, that's a miniscule number of NOPs.
The possibility of I/O is in no way exclusive to compressed instructions. If the page-crossing instruction was padded, the second page would need to be faulted in required anyway. All that matters is number of pages of code needed for the piece of code, which is simply just code size.
The only case that actually has a chance of mattering simply is just crossing cachelines.
And I would imagine high-performance cores would have some internal instruction buffer anyway, for doing cross-fetch-block instruction fusion and whatnot.
That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.
And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".
See how big of a block you need to get 90% of the compression benefit, etc.