Most active commenters
  • kragen(5)
  • garaetjjte(4)
  • burnt-resistor(4)
  • chasil(3)

←back to thread

224 points mshockwave | 26 comments | | HN request time: 0.409s | source | bottom
1. sloemoe ◴[] No.44502573[source]
Put that in your delay slot and smoke it.

https://en.wikipedia.org/wiki/Delay_slot

I'm surprised by how many other architectures use it.

replies(4): >>44502951 #>>44503609 #>>44503931 #>>44504952 #
2. jnwatson ◴[] No.44502951[source]
The TI C40 used them.
3. kragen ◴[] No.44503609[source]
It seemed like a good idea in 01981; the purported expansion of MIPS was "Microprocessor without Interlocked Pipeline Stages", although of course it's a pun on "millions of instructions per second". By just omitting the interlock logic necessary to detect branch hazards and putting the responsibility on the compiler, you get a chip that can run faster with less transistors. IBM's 45000-transistor 32-bit RISC "ROMP" was fabbed for use in IBM products that year, which gives you an idea of how precious silicon area was at the time.

Stanford MIPS was extremely influential, which was undoubtedly a major factor in many RISC architectures copying the delay-slot feature, including SPARC, the PA-RISC, and the i860. But the delay slot really only simplifies a particular narrow range of microarchitectures, those with almost exactly the same pipeline structure as the original. If you want to lengthen the pipeline, either you have to add the interlocks back in, or you have to add extra delay slots, breaking binary compatibility. So delay slots fell out of favor fairly quickly in the 80s. Maybe they were never a good tradeoff.

One of the main things pushing people to RISC in the 80s was virtual memory, specifically, the necessity of being able to restart a faulted instruction after a page fault. (See Mashey's masterful explanation of why this doomed the VAX in https://yarchive.net/comp/vax.html.) RISC architectures generally didn't have multiple memory accesses or multiple writes per instruction (ARM being a notable exception), so all the information you needed to restart the failed instruction successfully was in the saved program counter.

But delay slots pose a problem here! Suppose the faulting instruction is the delay-slot instruction following a branch. The next instruction to execute after resuming that one could either be the instruction that was branched to, or the instruction at the address after the delay-slot instruction, depending on whether the branch was taken or not. That means you need to either take the fault before the branch, or the fault handler needs to save at least the branch-taken bit. I've never programmed a page-fault handler for MIPS, the SPARC, PA-RISC, or the i860, so I don't know how they handle this, but it seems like it implies extra implementation complexity of precisely the kind Hennessy was trying to weasel out of.

The WP page also mentions that MIPS had load delay slots, where the datum you loaded wasn't available in the very next instruction. I'm reminded that the Tera MTA actually had a variable number of load delay slots, specified in a field in the load instruction, to allow the compiler to allow as many instructions as it could for the memory reference to come back from RAM over the packet-switching network. (The CPU would then stall your thread if the load took longer than the allotted number of instructions, but the idea was that a compiler that prefetched enough stuff into your thread's huge register set could make such stalls very rare.)

replies(2): >>44503906 #>>44507204 #
4. garaetjjte ◴[] No.44503906[source]
I think program counter is backed up and branch is just re-executed. Though it's annoying if handler wants to skip over faulting instruction (eg. it was a syscall), as it now needs to emulate the branch behavior in software. Most of the complexity is punted on the software, I think only hardware tweak needed is keeping in-delay-slot flag in fault description, and keeping address of currently executing instruction for fault reporting and PC-relative addressing (which probably could be omitted otherwise, keeping only next instruction address would be enough).
replies(1): >>44504052 #
5. vesinisa ◴[] No.44503931[source]
Whoa, had no idea this existed. Wild stuff. Might be "somewhat" confusing to read assembler code like that without knowing about this particular technique..
replies(2): >>44504037 #>>44508846 #
6. chasil ◴[] No.44504037[source]
Allow me to introduce you to register windows.

https://www.jwhitham.org/2016/02/risc-instruction-sets-i-hav...

replies(2): >>44504985 #>>44505013 #
7. kragen ◴[] No.44504052{3}[source]
Thank you! I guess that, as long as the branch instruction itself can't modify any of the state that would cause it to branch or not, that's a perfectly valid solution. It seems like load delay slots would be more troublesome; I wonder how the MIPS R2000 and R3000 handled that? (I'm not sure the Tera supported virtual memory.)
replies(1): >>44504620 #
8. garaetjjte ◴[] No.44504620{4}[source]
Load delay slots doesn't seem to need special fault handling support, you're not supposed to depend on old value being there in the delay slot.

One more thing about branch delay slots: It seems original SuperH went for very minimal solution. It prevents interrupts being taken between branch and delay slot, and not much else. PC-relative accesses are relative to the branch target, and faults are also reported with branch target address. As far I can see this makes faults in branch delay slots unrecoverable. In SH-3 they patched that by reporting faults in delay slots for taken branches with branch address itself, so things can be fixed up in the fault handler.

replies(1): >>44504690 #
9. kragen ◴[] No.44504690{5}[source]
Hmm, I guess that if the load instruction doesn't change anything except the destination register (unlike, for example, postincrement addressing modes) and the delay-slot instruction also can't do anything that would change the effective address being loaded from before it faulted (and can't depend on the old value), then you're right that it wouldn't need any special fault handling support. I'd never tried to think this through before, but it makes sense. I appreciate it.

As for SH2, ouch! So SH2 got pretty badly screwed by delay slots, eh?

replies(1): >>44505036 #
10. burnt-resistor ◴[] No.44504952[source]
SPIM says "all shall be efficient single cycle instructions and to heck with the MHz wars!" /s
11. apaprocki ◴[] No.44504985{3}[source]
Both register windows and the delay slot exist on SPARC processors, which you’re much more likely to run into in a data center (running open-source software).

Itanium was the really odd one — it not only used register windows but could offload some of the prior windows onto the heap. Most people would probably never notice… unless you’re trying to get a conservative scanning GC working and are stumped why values in some registers seem to not be traced…

replies(1): >>44505025 #
12. burnt-resistor ◴[] No.44505013{3}[source]
I was going to make a reference to Patterson & Hennessy, but it's too bad that the 5th and later editions are hidden behind a DRM paywall. You don't "own" books anymore.
13. burnt-resistor ◴[] No.44505025{4}[source]
Pour one out for Itanium. It tried to make the panacea of VLIW and branch hints work, but it didn't pan out.
replies(2): >>44505810 #>>44507100 #
14. garaetjjte ◴[] No.44505036{6}[source]
Even without faults, some SO answers indicate that on R2000 new value might be available in delay slot if it was a cache miss.

As for SuperH I don't think they cared too much. Primary use of handling faults is memory paging, and MMU was added only in SH-3, so that's probably the reason they also fixed delay slot fault recovery. Before that faults were either illegal opcodes or alignment violations, probably the answer for that was "don't do that".

replies(1): >>44505102 #
15. kragen ◴[] No.44505102{7}[source]
The new value was available earlier if it was a cache miss?

I didn't remember that the SH2 didn't support virtual memory (perhaps because I've never used SuperH). That makes sense, then.

I think that, for the ways people most commonly use CPUs, it's acceptable if the value you read from a register in a load delay slot is nondeterministic, for example depending on whether you resumed from a page fault or not, or whether you had a cache miss or not. It could really impede debugging if it happened in practice, and it could impede reverse-engineering of malware, but I believe that such things are actually relatively common. (IIRC you could detect the difference between an 8086 and an 8088 by modifying the next instruction in the program, which would have been already loaded by the 8086 but not the 8088. But I'm guessing that under a single-stepping debugger the 8086 would act like an 8088 in this case.) The solution would probably be "Stop lifting your arm like that if it hurts;" it's easy enough to not emit the offending instruction sequences from your compiler in this case.

The case where people really worry about nondeterminism is where it exposes information in a security-violating way, as in Spectre, which isn't even nondeterminism at the register-contents level, just the timing level.

Myself, I have a strong preference for strongly deterministic CPU semantics, and I've been working on a portable strongly deterministic (but not for timing) virtual machine for archival purposes. But clearly strong determinism isn't required for a usable CPU.

replies(1): >>44505427 #
16. garaetjjte ◴[] No.44505427{8}[source]
>The new value was available earlier if it was a cache miss?

Apparently so. Maybe the logic is that it is available one instruction later if it's a hit, but when it's a miss it's stalls entire pipeline anyway, and resumes only when result is available.

One source of non-determinism that stayed for long time in various architectures were LL/SC linked atomics. It mostly didn't matter but eg. rr recording debugger on AArch64 doesn't work on applications using these instead of newer CAS extension atomics.

replies(1): >>44505474 #
17. kragen ◴[] No.44505474{9}[source]
Oh, that makes sense.

WRT LL/SC, I don't think it's dead yet—isn't RISC-V's A extension using a sort of LL/SC? rr is indeed exactly the kind of collateral damage that I deplore. rr is amazing.

18. chasil ◴[] No.44505810{5}[source]
From an interview with Bob Colwell:

'Anyway this chip architect guy is standing up in front of this group promising the moon and stars. And I finally put my hand up and said I just could not see how you're proposing to get to those kind of performance levels. And he said well we've got a simulation, and I thought Ah, ok. That shut me up for a little bit, but then something occurred to me and I interrupted him again. I said, wait I am sorry to derail this meeting. But how would you use a simulator if you don't have a compiler? He said, well that's true we don't have a compiler yet, so I hand assembled my simulations. I asked "How did you do thousands of line of code that way?" He said “No, I did 30 lines of code”. Flabbergasted, I said, "You're predicting the entire future of this architecture on 30 lines of hand generated code?" [chuckle], I said it just like that, I did not mean to be insulting but I was just thunderstruck. Andy Grove piped up and said "we are not here right now to reconsider the future of this effort, so let’s move on".'

https://www.sigmicro.org/media/oralhistories/colwell.pdf

replies(1): >>44508005 #
19. chithanh ◴[] No.44507100{5}[source]
VLIW is maybe cool, but people will be relieving themselves on EPIC's grave for the pain that it inflicted on them.

Like if you tried to debug a software crash on Itanium. The customer provided core dump was useless as you could not see what was going on. Intel added a debug mode to their compilers which disabled all that EPIC so hopefully you could reproduce the crash there, or on other CPU architectures. Otherwise you were basically screwed.

replies(1): >>44507607 #
20. musicale ◴[] No.44507204[source]
Or "make interlocks programmed in software". But later MIPS versions had hardware interlocks I believe.
replies(1): >>44508832 #
21. burnt-resistor ◴[] No.44507607{6}[source]
EPIC :nauseous face emoji:

That HP-Intel arrangement was weird. One time, an Intel-badged employee came out to change a tape drive on a (Compaq->HP->HPE) Compaq SSL2020 tape robot. Okay, I guess they shared employees. ¯\_(ツ)_/¯

22. panick21_ ◴[] No.44508005{6}[source]
Sun had some funny stories around this too. When they came up with their multi-core system, and they used code from 10-15 years earlier for traces. And then said 'well, nobody actually uses floating code' so we don't need it. Of course over those 10 years Floating point became much more common and stand. Leading to a chip that had one FPU for 8 cores, basically meaning, even minimal floating point would destroy concurrency. Arguably Sun had already lose the chip war and this was just making them fall behind further. They did market it in quite well.

And a lesser known thing that I couldn't find much information on is that Sun also worked on VLIW chip during the 90s. Apparently Bill Joy was convinced that VLIW was the future so they did a VLIW chip, and the project was lead by David Ditzel. As far as I am aware this was never released. If any Sun veterans have any idea about this, I would love to know.

replies(1): >>44508191 #
23. chasil ◴[] No.44508191{7}[source]
As far as the single FPU that you mention, the T1 is an open-source CPU.

https://www.oracle.com/servers/technologies/opensparc-t1-pag...

The T2 is also open, and places an FPU in each core.

https://www.oracle.com/servers/technologies/opensparc-t2-pag...

When there is such complaint about closed firmware in the Raspberry Pi, and the risk of the Intel ME and other closed CPU features, I wonder why these open designs are ignored. Yes, the performance and power consumption would be poor by modern standards.

replies(1): >>44508212 #
24. panick21_ ◴[] No.44508212{8}[source]
These designed are not ignored. They were used for a few things here and there. But the usefulness of 'over the wall' open code without backing is always a bit limited and for processors that cost 100k to tap out, even more so.

By now there are much better more modern design out-there and for RISC-V.

25. bobmcnamara ◴[] No.44508832{3}[source]
Some kernel somewhere:

Switch(mipsarch): Case 1: Nop.

Case 2: Noop.

Case 10: Noooooooooop.

26. bobmcnamara ◴[] No.44508846[source]
Many assemblers had an option to reorder on assembly so you could write it normally, while only taking care to avoid hazards near branches.

At least one toolchain would just pad the slots with nops