Condor's Cuzco RISC-V Core at Hot Chips 2025

1. bee_rider ◴[30 Aug 25 15:20 UTC] No.45075394[source]▶

The static schedule part seems really interesting. They note that it only works for some instructions, but I wonder if it would be possible to have a compiler report “this section of the code can be statically scheduled.” In that case, could this have a benefit for real-time operation? Or maybe some specialized partially real-time application—mark a segment of the program as desiring static scheduling, and don’t allow memory loads, etc, inside there.

replies(4): >>45075641 #>>45075901 #>>45078349 #>>45087597 #

2. usrusr ◴[30 Aug 25 15:50 UTC] No.45075641[source]▶

>>45075394 (TP) #

What would the CPU do with the parts not marked as "can be statically scheduled"? I read it as they try it anyways and may get some stalling ("replay") if the schedule was overoptimistic. Not sure how a compiler marking sections could be of help?

replies(2): >>45075909 #>>45081427 #

3. IshKebab ◴[30 Aug 25 16:29 UTC] No.45075901[source]▶

>>45075394 (TP) #

I don't think that would help - the set of instructions that have dynamic latencies is basically fixed. Anything memory-related (loads, stores, cache management, fences, etc.) and complex maths (division, sqrt, transcendental functions, etc.)

So you know what code can be statically scheduled just from the instructions already.

replies(1): >>45076005 #

4. IshKebab ◴[30 Aug 25 16:30 UTC] No.45075909[source]▶

>>45075641 #

Stalling and replay are not the same btw. Stalling is when you wait a bit before continuing, replay is when you try an operation multiple times.

replies(1): >>45077825 #

5. bee_rider ◴[30 Aug 25 16:43 UTC] No.45076005[source]▶

>>45075901 #

I’m just spitballing. But, what if we had some system that went:

1) load some model and set the system into “ready” mode

2) wait for an event from a sensor

3) when the event occurs, trigger some response

4) do other stuff; book keeping, update the model, etc,

5) reset the system to “ready” mode and goto 2

Is it possible we might want some hard time bounds on steps 2 and 3, but be fine with 1, 4, and 5 taking however long? (Assuming the device can be inactive while it is getting ready). Then, we could make sure steps 2 and 3 don’t include any non-static instructions.

replies(1): >>45076810 #

6. IshKebab ◴[30 Aug 25 18:10 UTC] No.45076810{3}[source]▶

>>45076005 #

Not sure what you're getting at tbh... Do you know about interrupts?

7. usrusr ◴[30 Aug 25 20:41 UTC] No.45077825{3}[source]▶

>>45075909 #

So the difference is block everything until the dependency is available and then continue immediately, vs give up on time slots already reserved for downstream dependencies while continuing with those parts in the current schedule that are not blocked and copy the blocked parts at the end of the queue? Sounds like a trade-off that can go one way or the other? But yeah, I was using the term "stalling" in a browser sense, as the superset of both. No idea how incorrect that is.

replies(1): >>45078218 #

8. IshKebab ◴[30 Aug 25 21:39 UTC] No.45078218{4}[source]▶

>>45077825 #

Yeah I think even traditional OoO designs use replay for missed loads rather than stalling. The performance would be too bad if it actually stalled for every load.

I think stalling is used for rarer more awkward things like changing privilege modes or writing certain CSRs (e.g. satp) where you don't want to have to maintain speculative state.

replies(1): >>45079141 #

9. clamchowder ◴[30 Aug 25 21:58 UTC] No.45078349[source]▶

>>45075394 (TP) #

(author here) they try for all instructions, just that it's a prediction w/replay because inevitably some instructions like memory loads are variable latency. It's not like Nvidia where fixed latency instructions are statically scheduled, then memory loads/other variable latency stuff is handled dynamically via scoreboarding.

10. monocasa ◴[31 Aug 25 00:08 UTC] No.45079141{5}[source]▶

>>45078218 #

Traditional OoO designs don't stall for loads per se, but will stall for a full ROB that has a chain of dependencies waiting on the results of the load.

replies(1): >>45080863 #

11. IshKebab ◴[31 Aug 25 06:31 UTC] No.45080863{6}[source]▶

>>45079141 #

Good point, but I guess that's the sort of delay that you can't avoid. If there's literally no work to do until a load is available you have to wait. This design can't avoid that either.

12. imtringued ◴[31 Aug 25 08:18 UTC] No.45081427[source]▶

>>45075641 #

Assuming a parallel programming language and a SMT aware compiler, the CPU could just switch to another block of static instructions while it is waiting.

replies(2): >>45082184 #>>45087612 #

13. tliltocatl ◴[31 Aug 25 10:54 UTC] No.45082184{3}[source]▶

>>45081427 #

> Assuming a parallel programming language

Assuming a parallelizable workload, which is often not the case.

14. namibj ◴[31 Aug 25 22:14 UTC] No.45087597[source]▶

>>45075394 (TP) #

Given that Nvidia Maxwell / Pascal (mostly GTX 900 / GTX 1000 series) had a bit for each ISA read operand slot that said whether to cache that register file access for reuse by a subsequent instruction, and ARM and RISC-V have thumb/compressed encodings, I'd expect frontend support for blocks of pre-scheduled code (that could be loaded into something like AMD Zen3's μOP cache, as a sizable chunk to allow sufficient loop unrolling for efficiency) to be practical.

Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)......

I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision).

Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file.

15. namibj ◴[31 Aug 25 22:17 UTC] No.45087612{3}[source]▶

>>45081427 #

You mean like e.g. Nvidia Maxwell?

(There's decent 3rd party documentation from nervana systems from when they squeezed all they could out of f32 dense matrix multiply, at the time substantially faster than Nvidia's cuBLAS library; this is very not exclusive to that architecture, though.)