GlobalFoundries to Acquire MIPS

It seemed like a good idea in 01981; the purported expansion of MIPS was "Microprocessor without Interlocked Pipeline Stages", although of course it's a pun on "millions of instructions per second". By just omitting the interlock logic necessary to detect branch hazards and putting the responsibility on the compiler, you get a chip that can run faster with less transistors. IBM's 45000-transistor 32-bit RISC "ROMP" was fabbed for use in IBM products that year, which gives you an idea of how precious silicon area was at the time.

Stanford MIPS was extremely influential, which was undoubtedly a major factor in many RISC architectures copying the delay-slot feature, including SPARC, the PA-RISC, and the i860. But the delay slot really only simplifies a particular narrow range of microarchitectures, those with almost exactly the same pipeline structure as the original. If you want to lengthen the pipeline, either you have to add the interlocks back in, or you have to add extra delay slots, breaking binary compatibility. So delay slots fell out of favor fairly quickly in the 80s. Maybe they were never a good tradeoff.

One of the main things pushing people to RISC in the 80s was virtual memory, specifically, the necessity of being able to restart a faulted instruction after a page fault. (See Mashey's masterful explanation of why this doomed the VAX in https://yarchive.net/comp/vax.html.) RISC architectures generally didn't have multiple memory accesses or multiple writes per instruction (ARM being a notable exception), so all the information you needed to restart the failed instruction successfully was in the saved program counter.

But delay slots pose a problem here! Suppose the faulting instruction is the delay-slot instruction following a branch. The next instruction to execute after resuming that one could either be the instruction that was branched to, or the instruction at the address after the delay-slot instruction, depending on whether the branch was taken or not. That means you need to either take the fault before the branch, or the fault handler needs to save at least the branch-taken bit. I've never programmed a page-fault handler for MIPS, the SPARC, PA-RISC, or the i860, so I don't know how they handle this, but it seems like it implies extra implementation complexity of precisely the kind Hennessy was trying to weasel out of.

The WP page also mentions that MIPS had load delay slots, where the datum you loaded wasn't available in the very next instruction. I'm reminded that the Tera MTA actually had a variable number of load delay slots, specified in a field in the load instruction, to allow the compiler to allow as many instructions as it could for the memory reference to come back from RAM over the packet-switching network. (The CPU would then stall your thread if the load took longer than the allotted number of instructions, but the idea was that a compiler that prefetched enough stuff into your thread's huge register set could make such stalls very rare.)