In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).
Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....
What's your take on
1) unaligned 32bit instructions with the C extension?
2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..
2. nobody uses it on mips either, so it is likely of no use.
How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.
The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).
And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.
Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.
One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).
The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.
The possibility of I/O is in no way exclusive to compressed instructions. If the page-crossing instruction was padded, the second page would need to be faulted in required anyway. All that matters is number of pages of code needed for the piece of code, which is simply just code size.
The only case that actually has a chance of mattering simply is just crossing cachelines.
And I would imagine high-performance cores would have some internal instruction buffer anyway, for doing cross-fetch-block instruction fusion and whatnot.