> Very unlikely to me that this design would have comparable "raw" performance to a design that implements something closer to tomasulo's algorithm.
The point appears to be losing maybe a few percent (5%-7%) of performance, in exchange for saving tens of percent of energy consumption.
> The assumption that the latency of a load will be a l1 hit is a load bearing abstraction
That's just the initial assumption, that load results will appear 4 cycles after the load is started. If it gets to that +4 time and the result has not appeared then it looks for a empty execution slot starting another 14 cycles along (+18 in total) for a possible L2 cache return.
The original slot result is marked as "poison" so if any instruction is reached that depends on it, it will also be moved further along the RoB and it's original slot marked as poison, and so on.
If the dependent instruction was originally at least 18 cycles along from the load being issued then I think it will just pick up the result of the L2 hit (if that happens), and not have to be moved.
If L2 also misses and the result still has not been returned when you get to the moved instruction, a spare execution slot will again be searched for starting at another 20 cycles along (+38 in total), in preparation for an L3 hit.
The article says when searching for an empty slot only a maximum of 8 cycles worth of slots are searched. The article doesn't say what happens if there are no empty execution slots within that 8 cycle window. I suspect the instruction just gets moved right to the end, where all slots are empty.
It also doesn't say what happens if the load doesn't hit in L3. As the main memory latency is under control of the SoC vendor and/or the main board or system integrator (for sure not Condor), I suspect that L3 misses are also moved to the very end.