We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:
# a0 = rax, a1 = rbx
slli t0, a1, 64-8
rori a0, a0, 16
add a0, a0, t0
rori a0, a0, 64-16
[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...
That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.
And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".