In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).
Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....
> Because box86 uses the native versions of some “system” libraries, like libc, libm, SDL, and OpenGL, it’s easy to integrate and use with most applications, and performance can be surprisingly high in some cases.
Wine can also be compiled/run as native.
We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:
# a0 = rax, a1 = rbx
slli t0, a1, 64-8
rori a0, a0, 16
add a0, a0, t0
rori a0, a0, 64-16
[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...What's your take on
1) unaligned 32bit instructions with the C extension?
2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..
2. nobody uses it on mips either, so it is likely of no use.
Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...
RISC was explained to me as a reduced instruction set computer in computer science history classes, but I see a lot of articles and proposed new RISC-V profiles about "we just need a few more instructions to get feature parity".
I understand that RISC-V is just a convenient alternative to other platforms for most people, but does this also mean the RISC dream is dead?
Although as I recall from reading the RISC-V spec, RISC-V was rather particular about not adding "combo" instructions when common instruction sequences can be fused by the frontend.
My (far from expert) impression of RISC-V's shortcomings versus x86/ARM is more that the specs were written starting with the very basic embedded-chip stuff, and then over time more application-cpu extensions were added. (The base RV32I spec doesn't even include integer multiplication.) Unfortunately they took a long time to get around to finishing the bikeshedding on bit-twiddling and simd/vector extensions, which resulted in the current functionality gaps we're talking about.
So I don't think those gaps are due to RISC fundamentalism; there's no such thing.
Assembly programming is a real pain in the RISCiest of RISC architectures, like SPARC. Here's an example from https://www.cs.clemson.edu/course/cpsc827/material/Code%20Ge...:
• All branches (including the one caused by CALL, below) take place after execution of the following instruction.
• The position immediately after a branch is the “delay slot” and the instruction found there is the “delay instruction”.
• If possible, place a useful instruction in the delay slot (one which can safely be done whether or not a conditional branch is taken).
• If not, place a NOP in the delay slot.
• Never place any other branch instruction in a delay slot.
• Do not use SET in a delay slot (only half of it is really there).
The surviving CISCs, x86 and z390 are the least CISCy CISCs. The surviving RISCs, arm and power, are the least RISCy RISCs.
RISC V is a weird throwback in some aspects of its instruction set design.
I'm not sure you can run Wine natively to run x86 Windows programs on RISC-V because Wine is not an emulator. There is an ARM port of Wine, but that can only run Windows ARM programs, not x86.
Instead box64 is running the x86_64 Wine https://github.com/ptitSeb/box64/blob/main/docs/X64WINE.md
(Although I suspect providing the API of one architecture while building for another is far easier said than done. Toolchains tend to be uncooperative about such shenanigans, for starters.)
How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.
The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).
And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.
Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.
In theory, if you compiled the original _source_ code for RISC, you'd get an entirely binary and wouldn't need those specific instructions.
In practice, I doubt anyone is going to actually compile these games for RISCV5.
Wine's own documentation says it requires an emulator: https://wiki.winehq.org/Emulation
> As Wine Is Not an Emulator, all those applications can't run on other architectures with Wine alone.
Or do you mean provide the x86_64 Windows API as a native RISC-V/ARM to the emulator layer? That would require some deeper integration for the emulator, but that's what Box64/box86 already does with some Linux libraries: intercept the api calls and replace them with native libraries. Not sure if it does it for wine
This didn't work out, for two main reasons: first, just being able to turn clocks hella high is still not enough to get great performance: you really do want your CPU to be super-scalar, out-of-order, and with great branch predictor, if you need amazing performance. But when you do all that, the simplicity of RISC decoding stops mattering all that much, as Pentium II demonstrated when it equalled DEC Alpha on performance, while still having practically useful things like e.g. byte loads/stores. Yes, it's RISC-like instructions under the hood but that's an implementation detail, no reason to expose it to the user in the ISA, just as you don't have to expose the branch delay slots in your ISA because it's a bad idea to do so: e.g. MIPS II added 1 additional pipeline stage, and now they needed two branch/load delay slots. Whoops! So they added interlocks anyway (MIPS originally stood for "Microprocessor without Interlocked Pipelined Stages", ha-ha) and got rid of the load delays; they still left 1 branch delay slot exposed due to backwards compatibility, and the circuitry required was arguably silly.
The second reason was that the software (or compilers, to be more precise) can't really deal very well with all that stuff from the first paragraph. That's what sank Itanium. That's why nobody makes CPUs with register windows any more. And static instruction scheduling in the compilers still can't beat dynamic instruction reordering.
In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.
However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.
c.f. https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...
Yeah, that's what I meant. It's simple in principle, after all: turn an AMD64 call into an ARM/RISCV call and pass it to native code.
Doing that for Wine would be pretty tricky (way more surface area to cover, possible differences between certain Win32 arch-specific structs and so forth) so I bet that's not how it works out of the box, but I couldn't tell for sure by skimming through the box64 repo.
I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)
There are essentially two design philosophies:
1. 32-bit instructions, and 64 bit naturally aligned instructions
2. 16/32/48/64 bit instructions with 16 bit alignment
Implementation complexity is debatable, although it seems to somewhat favor options 1:
1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things
2: you need to find instruction starts, and handle decoding instructions that span across a cache line
How big the impact is relative to the entire design is quite unclear.
Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.
In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.
If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.
That's not the whole story, a simpler pipeline takes less engineering resources for teams going to a high performance design so they can spend more time optimizing.
RISC is generally a philosophy of simplification but you can take it further or less far. MIPS is almost as simplified as RISC-V but ARM and POWER are more moderate in their simplifications and seem to have no trouble going toe to toe with x86 in high performance arenas.
But remember there are many niches for processors out there besides running applications. Embedded, accelerators, etc. In the specific niche of application cores I'm a bit pessimistic about RISC-V but from a broader view I think it has a lot of potential and will probably come to dominate at least a few commercial niches as well as being a wonderful teaching and research tool.
This is just insane and gets us full-circle to why we want RISC-V.
One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).
The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.
Modern CPUs are actually really good at deciding operations into micro-ops. And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.
Is there a bunch of legacy crap in x86? Yeah. Does getting rid of dramatically increase the performance ceiling? Probably not.
The real benefit of RISC-V is anybody can use it. It's democratizing the ISA. No one has to pay a license to use it, they can just build their CPU design and go.
The largest out-of-order CPUs are actually quite reliant on having high-performance decode that can be performed in parallel using multiple hardware units. Starting from a simplified instruction set with less legacy baggage can be an advantage in this context. RISC-V is also pretty unique among 64-bit RISC ISA's wrt. including compressed instructions support, which gives it code density comparable to x86 at a vastly improved simplicity of decode (For example, it only needs to read a few bits to determine which insns are 16-bit vs. 32-bit length).
Hey! ;-)
Except this isn't actually true.
> Does getting rid of dramatically increase the performance ceiling? Probably not.
No but it dramatically DECREASES the amount of investment necessary to reach that ceiling.
Assume you have 2 teams, each get the same amount of money. Then ask them to make the highest performing spec compatible chip. What team is gone win 99% of the time?
> And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.
You can add microcode to a RISC-V chip if you want, most people just don't want to.
> The real benefit of RISC-V is anybody can use it.
That is true, but its also just a much better instruction set then x86 -_-
Its clearly necessary to have comparability back to the 80s. Its clearly necessary to have 10 different generation of SIMD. Its clearly necessary to have multiple different floating point systems.
Its simply about the amount of investment. x86 had 50 years of gigantic amounts of sustained investment. Intel outsold all the RISC vendors combined by like 100 to 1 because they owned the PC business.
When Apple started seriously investing in ARM. They were able to match of beat x86 laptops.
The same will be true for RISC-V.
I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?
I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?
There's not much inherent that needs to change in software approach. Probably the biggest thing vs x86-64 is the availability of 32 registers (vs 16 on x86-64), allowing for more intermediate values before things start spilling to stack, which also applies to ARM (which too has 32 registers). But generally it doesn't matter unless you're micro-optimizing.
More micro-optimization things might include:
- The vector extension (aka V or RVV) isn't in the base rv64gc ISA, so you might not get SIMD optimizations depending on the target; whereas x86-64 and aarch64 have SSE2 and NEON (128-bit SIMD) in their base.
- Similarly, no popcount & count leading/trailing zeroes in base rv64gc (requires Zbb); base x86-64 doesn't have popcount, but does have clz/ctz. aarch64 has all.
- Less efficient branchless select, i.e. "a ? b : c"; takes ~4-5 instrs on base rv64gc, 3 with Zicond, but 1 on x86-64 and aarch64. Some hardware can also fuse a jump over a mv instruction to be effectively branchless, but that's even more target-specific.
RISC-V profiles kind of solve the first two issues (e.g. Android requires rva23, which requires rvv & Zbb & Zicond among other things) but if linux distros decide to target rva20/rv64gc then they're ~forever stuck without having those extensions in precompiled code that hasn't bothered with dynamic dispatch. Though this is a problem with x86-64 too (much less so with ARM as it doesn't have that many extensions; SVE is probably the biggest thing by far, and still not supported widely (i.e. Apple silicon doesn't)).
Since RISC-V ISA is worldwide royalty free and more than nice, I am writting basic rv64 assembly which I do interpret on x86_64 hardware with a linux kernel.
I did not push the envelop up to have a "compiler", because it is indeed while waiting for hardcore performant desktop, aka large, rv64 hardware implementations.
... except it did.
You had literal students design chips that outperformed industry cores that took huge teams and huge investment.
Acorn had a team of just a few people build a core that outperformed an i460 with likely 1/100 investment. Not to mention the even more expensive VAX chips.
Can you imagine how fucking baffled the DEC engineers at the time were when their absurdly complex and absurdly expensive VAX chip were smocked by a bunch of first time chip designers?
> as Pentium II demonstrated
That chip came out in 1997. The original RISC chip research happened in the early 80s or even earlier. It did work, its just that x86 was bound to the PC market and Intel had the finances huge teams hammer away at the problem. x86 was able to overtake Alpha because DEC was not doing well and they couldn't invest the required amount.
> no reason to expose it to the user in the ISA
Except that hidden the implementation is costly.
If you give 2 equal teams the same amount of money, what results in a faster chip. A team that does a simply RISC instruction set. Or a team that does a complex CISC instruction set, transforms that into an underlying simpler instruction set?
Now of course for Intel, they had backward comparability so they had to do what they had to do. They were just lucky they were able to invest so much more then all the other competitors.
Most of the time, nothing; code correctly written on higher-level languages like C should work the same. The biggest difference, the weaker memory model, is something you also have on most non-x86 architectures like ARM (and your code shouldn't be depending on having a strong memory model in the first place).
> I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?
For historical reasons, executable code density on x86 is not that good, so the executable size won't increase as much as you'd expect; both RISC-V with its compressed instructions extension and 32-bit ARM with its Thumb extensions are fairly compact (there was an early RISC-V paper which did that code size comparison, if you want to find out more).
> I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?
What matters most is not CISC vs RISC, but the presence and quality of things like vector instructions and cryptography extensions. Some kinds of software like video encoding and decoding heavily depend on vector instructions to have good performance, and things like full disk encryption or hashing can be helped by specialized instructions to accelerate specific algorithms like AES and SHA256.
But for an emulator like this, box64 has to pick how to emulate vectorized instructions on RiscV (eg slowly using scalars or trying to reimplement using native vector instructions). The challenge of course is that typically you don’t get as good a performance unless the emulator can actually rewrite the code on the fly because a 1:1 mapping is going to be suboptimal vs noticing patterns of high level operations being performed and providing a more optimized implementation that replaces an alternate chunk of instructions at once instead to account for implementation differences on the chip (eg you may have to emulate missing instructions but a rewriter could skip emulation if there’s an alternate way to accomplish the same high level computation)
The biggest challenge for something like this from a performance perspective of course will be translating the GPU stuff efficiently to hit the native driver code and that Riscv likely is relying on OSS GPU drivers (and maybe wine to add another translation layer if the game is windows only )
Depends on the amount of money. If it's less a certain amount, RISC design will be faster. If it's above, both designs will perform about the same.
I mean, look at ARM: they too have decode their instructions into micro-ops and cache those in their high-performance models. What RISC buys you is the ability to be competitive at the low end of the market, with simplistic implementations. That's why we won't ever see e.g. a stack-like machine — no exposed general-purpose registers, but with flexible addressing modes for the stack, even something like [SP+[SP+12]]; stack is mirrored onto the hidden register file which is used as an "L0" cache which neatly solves the problem that register windows were supposed to solve, — such a design can be made as fast as server-grade x86 or ARM, but only by throwing billions of dollars and several man-millenia at it; and if you try to do it cheaper and quicker, its performance would absolutely suck. That's why e.g. System/360 didn't make that design choice although IBM seriously considered it for half a year — they then found out that the low-level machines would be unacceptably slow so they went with "registers with base-plus-offset addressed memory" design.
There's even "#pragma clang loop vectorize(assume_safety)" to tell it that pointer aliasing won't be an issue (gcc has a similar "#pragma GCC ivdep"), which should get rid of most odd reasons for missed vectorization.
There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).
Right, but most of the time those are architecture specific and RVV 1.0 is substantially different than say, NEON or SSE2, so you need to change it anyways. You also typically use specialized registers for those, not the general purpose registers. I'm not saying there isn't work to be done (especially in for an application like this one, that is extremely performance sensitive), I'm saying that most applications won't have these problems are be so sensitive that register spills matter much if at all.
> a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel
In practice, ARM processors decode up to 4 instructions in parallel; so do Intel and AMD.
And:
Milk-V Pioneer A 64-core, RISC-V motherboard and workstation for native development
On a slight tangent, from a security perspective, if your silicon is "too clever" in a way that introduces security bugs, you're screwed. On the other hand, software can be patched.
If it's in the linker then tracking pages sounds pretty doable.
You don't need to care about multiple page sizes. If you pad at the minimum page size, or even at 1KB boundaries, that's a miniscule number of NOPs.
This is more about being free of ARM’s patents and getting a fresh start using the lessons learned
Removed complexity means you can fit more stuff into the same amount of silicon, and have it be quicker with less power.
To me, the whole RISC-V interest is all just marketing. As an end user I don't make my own chips and I can't think of any particular reason I should care whether a machine has RISC-V, ARM, x86, SPARC, or POWER. In the end my cost will be based on market scale and performance. The licensing cost of the design will not be passed on to me as a customer.
Characteristics of classical RISC:
- Most data manipulation instructions work only with registers.
- Memory instructions are generally load/store to registers only.
- That means you need lots of registers.
- Do your own stack because you have to manually manipulate it to pass parameters anyway. So no CALL/JSR instruction. Implement the stack yourself using some basic instructions that load/store to the instruction pointer register directly.
- Instruction encoding is predictable and each instruction is the same size.
- More than one RISC arch has a register that always reads 0 and can't be written. Used for setting things to 0.
This worked, but then the following made it less important:
- Out-of-order execution - generally the raw instruction stream is a declaration of a path to desired results, but isn't necessarily what the CPU is really doing. Things like speculative execution, branch prediction and register renaming are behind this.
- SIMD - basically a separate wide register space with instructions that work on all values within those wide registers.
So really OOO and SIMD took over.
It often feels like as a community we don't have an interest in making better tools than those we started with.
Communicating with the compiler, and generating code with code, and getting information back from the compiler should all be standard things. In general they shouldn't be used, but if we also had better general access to profiling across our services, we could then have specialists within our teams break out the special tools and improve critical sections.
I understand that many of us work on projects with already absurd build times, but I feel that is a side effect of refusal to improve ci/cd/build tools in a similar way.
If you have ever worked on a modern TypeScript framework app, you'll understand what I mean. You can create decorators and macros talking to the TypeScript compiler and asking it to generate some extra JS or modify what it generates. And the whole framework sits there running partial re-builds and refreshing your browser for you.
It makes things like golang feel like they were made in the 80s.
Freaking golang... I get it, macros and decorators and generics are over-used. But I am making a library to standardize something across all 2,100 developers within my company... I need some meta-programming tools please.
Sure, addition and moving bits between registers takes one clock cycle, but those kinds of instructions take one clock cycle on CISC as well. And very tiny RISC microcontrollers can take more than one cycle for adds and shifts if you're really stingy with the silicon.
(Memory operations will of course take multiple cycles too, but that's not the CPU's fault.)
Which seems like stuff you want support for, but this is seemingly arguing against?
The possibility of I/O is in no way exclusive to compressed instructions. If the page-crossing instruction was padded, the second page would need to be faulted in required anyway. All that matters is number of pages of code needed for the piece of code, which is simply just code size.
The only case that actually has a chance of mattering simply is just crossing cachelines.
And I would imagine high-performance cores would have some internal instruction buffer anyway, for doing cross-fetch-block instruction fusion and whatnot.
In practice there's no such thing as "RISC" or "CISC" anymore really, they've all pretty much converged. At best you can say "RISC" now just means that there aren't any mixed load + alu instructions, but those aren't really used in x86 much, either
Another would be that something like fused multiple add would have different (higher if I recall correctly) precision which violates IEE754 and thus vectorization since default options are standard compliant.
Another is that some math intrinsics are documented to populate errno which would prevent using autovec in paths that have an intrinsic.
There may be other nuances depending on float vs double.
Basically most of the things that make up ffast-math i believe would prevent autovectorization.
False premise, as size tool shows RVA20(RV64GC) binaries were already smallest among 64bit architectures.
Code gets smaller still (rather than larger) with newer extensions such as B in RVA22.
As of recently, the same is true in 32bit when comparing rv32 against former best (thumb2). But it was quite close before to begin with.
For <math.h> errno, there's -fno-math-errno; indeed included in -ffast-math, but you don't need the entirety of that mess for this.
Loops with a float accumulator is I believe the only case where -ffast-math is actually required for autovectorizability (and even then iirc there are some sub-flags such that you can get the associativity-assuming optimizations while still allowing NaN/inf).
That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.
And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".
See how big of a block you need to get 90% of the compression benefit, etc.
We are three decades away from those days, with more than enough hardware to run those systems, only available in universities or companies with very deep pockets.
Yet, the Go culture hasn't updated themselves, or very reluctantly, with the usual mistakes that were already to be seen when 1.0 came out.
And since they hit gold with CNCF projects, pretty much unavoidable for some work.
https://userpages.umbc.edu/~vijay/mashey.on.risc.html
Notably x86 is one of the less CISCy CISCs so it looks like there might be a happy medium.
I'm wondering, how's the landscape nowadays. Is this the leading project for x86 compatibility on ARM? With the rising popularity of the architecture for consumer platforms, I'd guess companies like Valve would be interested in investing in these sort of translation layers.
More characteristic are assumptions about side effects (none for integer, and cumulative flags for FP) and number of register file ports needed.