So you know what code can be statically scheduled just from the instructions already.
1) load some model and set the system into “ready” mode
2) wait for an event from a sensor
3) when the event occurs, trigger some response
4) do other stuff; book keeping, update the model, etc,
5) reset the system to “ready” mode and goto 2
Is it possible we might want some hard time bounds on steps 2 and 3, but be fine with 1, 4, and 5 taking however long? (Assuming the device can be inactive while it is getting ready). Then, we could make sure steps 2 and 3 don’t include any non-static instructions.
I think stalling is used for rarer more awkward things like changing privilege modes or writing certain CSRs (e.g. satp) where you don't want to have to maintain speculative state.
Assuming a parallelizable workload, which is often not the case.
Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)......
I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision).
Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file.
(There's decent 3rd party documentation from nervana systems from when they squeezed all they could out of f32 dense matrix multiply, at the time substantially faster than Nvidia's cuBLAS library; this is very not exclusive to that architecture, though.)