I guess the notion is that data cache misses will basically lead to what could be called "instruction amplification," where an instruction will miss its scheduled time slot and have to be replayed, possibly repeatedly, until its dependencies are available. The article asserts that this is the rough equivalent of leaving execution ports unoccupied in a "traditional" OoO architecture, but I'm not so sure. I'm curious about how well this works in practice; I would worry that cache misses would rapidly multiply into a cascading failure case where the entire pipeline basically stalls and the architecture reverts to in-order level performance - just like most general-purpose VLIW architectures.
Variable &=~(2^Bit)
The series of bitwise operators looks more grawlix (https://en.wikipedia.org/wiki/Grawlix) than instructions, as though yelling pejoratives at the bit is what clears it.Some real-world examples in simdjson: https://arxiv.org/pdf/1902.08318
I'd think there are quite a few data structures and algorithms where there can be benefits of using powers of two, or to count bits in a word.
RISC-V without the B(itmanip) extension is otherwise quite spartan. B also contains many instructions that other ISAs have in their base set, such as address calculation, and/or/xor not, rol/ror, and even some zero/sign-extension ops.
This latest core looks very interesting, can't wait to see it hit silicon and see what it can really do!
So you know what code can be statically scheduled just from the instructions already.
Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm. The assumption that the latency of a load will be a l1 hit is a load bearing abstraction; I can imagine scenarios where this acts as a "double jeopardy" causing scheduling to lock up because the latency was mispredicted, but one could also speculate that isn't important because the workload is already memory bound.
There's an intuition in computer architecture that designs that lean on "static" instruction scheduling mechanisms are less performant than more dynamic mechanisms for general purpose compute, but we've had decades of compiler development since itanium "proved" this. Efficient computer (or whatever their name is) is doing something cool too, it's exciting to see where this will go
1) load some model and set the system into “ready” mode
2) wait for an event from a sensor
3) when the event occurs, trigger some response
4) do other stuff; book keeping, update the model, etc,
5) reset the system to “ready” mode and goto 2
Is it possible we might want some hard time bounds on steps 2 and 3, but be fine with 1, 4, and 5 taking however long? (Assuming the device can be inactive while it is getting ready). Then, we could make sure steps 2 and 3 don’t include any non-static instructions.
I think an equally large change is the enormous rise of open source and supply chain focus. When Itanium came out, there was tons of code businesses ran which had been compiled years ago, lots of internal reimplementation of what would now be library code, and places commonly didn’t upgrade for years because that was also often a licensing purchase. Between open source and security, it’s a lot more reasonable now to think people will be running optimized binaries from day one and in many cases the common need to support both x86 and ARM will have flushed out a lot of compatibility warts along with encouraging use of libraries rather than writing as many things on their own.
CPU manufacturers also aren't using Unicode, using the letter u instead of µ (micro), and the letter A instead of Å (the unit Ångström).
Sure, but until someone doesn't do "The assumption that the latency of a load will be a l1 hit," they're in trouble for most of what we think of as "general purpose" computing.
I think you get it, but there's this overall trope that the issue with Itanium was purely compiler-related: that we didn't have the algorithms or compute resource to parallelize enough of a single program's control flow to correctly fill the dispatch slots in a bundle. I really disagree with this notion: this might have been _a_ problem, but it wasn't _the_ problem.
Even an amazing compiler which can successfully resolve all data dependencies inside of a single program and produce a binary containing ideal instruction bundling has no idea what's in dcache in the case of an interrupt/context switch, and therefore every load and all of its dependencies risks a stall (or in this case, replay) for a statically scheduled architecture, while a modern out-of-order architecture can happily keep going, even speculatively taking both sides of branches.
The modern approach to optimize datacenter computing is to aggressively pack in context switches, with many execution contexts (processes, user groups/containers, whatever) per guest domain and many guest domains per hypervisor.
Basically: I have yet to see someone successfully use the floor plan they took back from not doing out-of-order to effectively fill in for memory latency in a "general purpose" datacenter computing scenario. Most designers just add more cores, which only makes the problem worse (even adding more cache would be better than more cores!).
VLIW and this kind of design have a place: I could see a design like this being useful in place of Cortex-A or even Cortex-X in a lot of edge compute use cases, and of course GPUs and DSPs already rely almost exclusively on some variety of "static" scheduling already. But as a stated competitor to something like Neoverse/Graviton/Veyron in the datacenter space, the "load-bearing load" (I like your description!) seems like it's going to be a huge problem.
I think stalling is used for rarer more awkward things like changing privilege modes or writing certain CSRs (e.g. satp) where you don't want to have to maintain speculative state.
It's a smart idea, and has some parallels to the Mill CPU design. The backend is conceptually similar to a statically scheduled VLIW core, and the front end races ahead using it's matrix scorecard trying to queue up as much as it can for it vs the presence of unpredictable latencies.
There were some fascinating concepts being explored in that project. It's a shame nothing came of it.