←back to thread

154 points rbanffy | 7 comments | | HN request time: 0.755s | source | bottom
1. 1024bees ◴[] No.45075917[source]
It's nice to see a microarchitecture take a risk, and getting perspective on how this design performs with respect to performance, power and area would be interesting.

Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm. The assumption that the latency of a load will be a l1 hit is a load bearing abstraction; I can imagine scenarios where this acts as a "double jeopardy" causing scheduling to lock up because the latency was mispredicted, but one could also speculate that isn't important because the workload is already memory bound.

There's an intuition in computer architecture that designs that lean on "static" instruction scheduling mechanisms are less performant than more dynamic mechanisms for general purpose compute, but we've had decades of compiler development since itanium "proved" this. Efficient computer (or whatever their name is) is doing something cool too, it's exciting to see where this will go

replies(4): >>45076340 #>>45078044 #>>45079822 #>>45081255 #
2. acdha ◴[] No.45076340[source]
> we've had decades of compiler development since itanium "proved" this.

I think an equally large change is the enormous rise of open source and supply chain focus. When Itanium came out, there was tons of code businesses ran which had been compiled years ago, lots of internal reimplementation of what would now be library code, and places commonly didn’t upgrade for years because that was also often a licensing purchase. Between open source and security, it’s a lot more reasonable now to think people will be running optimized binaries from day one and in many cases the common need to support both x86 and ARM will have flushed out a lot of compatibility warts along with encouraging use of libraries rather than writing as many things on their own.

3. bri3d ◴[] No.45078044[source]
> we've had decades of compiler development since itanium "proved" this

Sure, but until someone doesn't do "The assumption that the latency of a load will be a l1 hit," they're in trouble for most of what we think of as "general purpose" computing.

I think you get it, but there's this overall trope that the issue with Itanium was purely compiler-related: that we didn't have the algorithms or compute resource to parallelize enough of a single program's control flow to correctly fill the dispatch slots in a bundle. I really disagree with this notion: this might have been _a_ problem, but it wasn't _the_ problem.

Even an amazing compiler which can successfully resolve all data dependencies inside of a single program and produce a binary containing ideal instruction bundling has no idea what's in dcache in the case of an interrupt/context switch, and therefore every load and all of its dependencies risks a stall (or in this case, replay) for a statically scheduled architecture, while a modern out-of-order architecture can happily keep going, even speculatively taking both sides of branches.

The modern approach to optimize datacenter computing is to aggressively pack in context switches, with many execution contexts (processes, user groups/containers, whatever) per guest domain and many guest domains per hypervisor.

Basically: I have yet to see someone successfully use the floor plan they took back from not doing out-of-order to effectively fill in for memory latency in a "general purpose" datacenter computing scenario. Most designers just add more cores, which only makes the problem worse (even adding more cache would be better than more cores!).

VLIW and this kind of design have a place: I could see a design like this being useful in place of Cortex-A or even Cortex-X in a lot of edge compute use cases, and of course GPUs and DSPs already rely almost exclusively on some variety of "static" scheduling already. But as a stated competitor to something like Neoverse/Graviton/Veyron in the datacenter space, the "load-bearing load" (I like your description!) seems like it's going to be a huge problem.

4. jasonwatkinspdx ◴[] No.45079822[source]
This is still using a Tomasulo like algorithm, it's just been shifted from the backend to the front end. And instructions don't lock up on an L1 miss. Instead the results of that instruction are marked as poisoned, and the front end replays the their microps forward in the execution stream once the L1 miss is resolved. As the article points out, this replay is likely to fill out otherwise unused execution slots on general purpose code, as OoO cpus rarely sustain their full execution width.

It's a smart idea, and has some parallels to the Mill CPU design. The backend is conceptually similar to a statically scheduled VLIW core, and the front end races ahead using it's matrix scorecard trying to queue up as much as it can for it vs the presence of unpredictable latencies.

replies(1): >>45080091 #
5. quantummagic ◴[] No.45080091[source]
> Mill CPU design

There were some fascinating concepts being explored in that project. It's a shame nothing came of it.

replies(1): >>45082621 #
6. brucehoult ◴[] No.45081255[source]
> Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm.

The point appears to be losing maybe a few percent (5%-7%) of performance, in exchange for saving tens of percent of energy consumption.

> The assumption that the latency of a load will be a l1 hit is a load bearing abstraction

That's just the initial assumption, that load results will appear 4 cycles after the load is started. If it gets to that +4 time and the result has not appeared then it looks for a empty execution slot starting another 14 cycles along (+18 in total) for a possible L2 cache return.

The original slot result is marked as "poison" so if any instruction is reached that depends on it, it will also be moved further along the RoB and it's original slot marked as poison, and so on.

If the dependent instruction was originally at least 18 cycles along from the load being issued then I think it will just pick up the result of the L2 hit (if that happens), and not have to be moved.

If L2 also misses and the result still has not been returned when you get to the moved instruction, a spare execution slot will again be searched for starting at another 20 cycles along (+38 in total), in preparation for an L3 hit.

The article says when searching for an empty slot only a maximum of 8 cycles worth of slots are searched. The article doesn't say what happens if there are no empty execution slots within that 8 cycle window. I suspect the instruction just gets moved right to the end, where all slots are empty.

It also doesn't say what happens if the load doesn't hit in L3. As the main memory latency is under control of the SoC vendor and/or the main board or system integrator (for sure not Condor), I suspect that L3 misses are also moved to the very end.

7. Findecanor ◴[] No.45082621{3}[source]
Last post on their forum a month ago, they claimed that they were live and having progress, but I dunno ...

What I'm afraid of is that perhaps they have been shifting what their goal is a little too often, which of course would delay their time to market.

For example, I think they have shifted from straightforward fixed-SIMD to scalable vectors of some sort, and last I heard they were talking about AI .. which usually means that there's some kind of support for matrix multiplication.