Most active commenters
  • monocasa(4)
  • pizlonator(3)

←back to thread

184 points onename | 13 comments | | HN request time: 0s | source | bottom
Show context
gdiamos ◴[] No.45898849[source]
Transmeta made a technology bet that dynamic compilation could beat OOO super scalar CPUs in SPEC.

It was wrong, but it was controversial among experts at the time.

I’m glad that they tried it even though it turned out to be wrong. Many of the lessons learned are documented in systems conferences and incorporated into modern designs, ie GPUs.

To me transmeta is a great example of a venture investment. If it would have beaten Intel at SPEC by a margin, it would have dominated the market. Sometimes the only way to get to the bottom of a complex system is to build it.

The same could be said of scaling laws and LLMs. It was theory before Dario, Ilya, OpenAI, et al trained it.

replies(7): >>45898875 #>>45899126 #>>45899335 #>>45901599 #>>45902119 #>>45903852 #>>45906222 #
1. pshirshov ◴[] No.45899126[source]
Aren't modern CPUs, essetially, dynamic translators from x86_64 instruction set into internal RISC-like intsruction sets?
replies(3): >>45899325 #>>45901534 #>>45901535 #
2. p_l ◴[] No.45899325[source]
Not to the same level. Crusoe was, in many ways, more classic CISC than x86 - except it's microcode was actually doing dynamic translation to internal ISA instead of operating like interpreter in old CISCs.

x86 ISA had the funny advantage of being way closer to RISC than "beloved" CISC architectures of old like m68k or VAX. Many common instructions translate to single "RISCy" instruction for the internal microarchitecture (something AMD noted IIRC in the original K5 with its AMD29050-derived core as "most instructions translate to 1 internal microinstruction, some between 2 to 4"). X86 prefixes are also way simpler than the complicated logic of decoding m68k or VAX. An instruction with multiple prefixes will quite probably decode to single microinstruction.

That said, there's funny thing in that Transmeta tech survived quite a long way to the point that there were Android tablets, in fact flagship Google ones like Nexus 9, whose CPU was based on it - because nvidia "Denver" architecture used same technology (AFAIK licensed from Transmeta, but don't cite me on this)

replies(2): >>45899500 #>>45901247 #
3. mananaysiempre ◴[] No.45899500[source]
> Many common [x86] instructions translate to single "RISCy" instruction for the internal microarchitecture

And then there are read-modify-write instructions, which on modern CPUs need two address-generation μops in addition to the load one, the store one, and the ALU one. So the underlying load-store architecture is very visible.

There’s also the part where we’ve trained ourselves out of using the more CISCy parts of x86 like ENTER, BOUND, or even LOOP, because they’ve been slow for ages, and thus they stay slow.

replies(2): >>45899729 #>>45903282 #
4. p_l ◴[] No.45899729{3}[source]
Even many of the more complex instructions often can translate into surprisingly short sequences - all sorts of loop structures have now various kinds of optimizations including instruction fusion that probably would not be necessary if we didn't stop using higher level LOOP constructs ;-)

But for example REP MOVS now is fused into equivalent of using SSE load-stores (16 bytes) or even AVX-512 load stores (64 bytes).

And of course equivalent of LEA by using ModRM/SIB prefixes is pretty much free with it being AFAIK handled as pipeline step

5. taolson ◴[] No.45901247[source]
>something AMD noted IIRC in the original K5 with its AMD29050-derived core

Just a small nitpick: I've seen the K5/29050 connection mentioned in a number of places, but the K5 was actually based upon an un-released superscalar 29K project called "Jaguar", not the 29050, which was a single-issue, in-order design.

6. JoshTriplett ◴[] No.45901535[source]
Modern CPUs still translate individual instructions to corresponding micro-ops, and do a bit of optimization with adjacent micro-ops. Transmeta converted whole regions of code at a time, and I think it tried to do higher-level optimizations.
7. pizlonator ◴[] No.45901534[source]
Folks like to say that, but that's not what's happening.

The key difference is: what is an instruction set? Is it a Turing-complete thing with branches, calls, etc? Or is it just data flow instructions (math, compares, loads and stores, etc)?

X86 CPUs handle branching in the frontend using speculation. They predict where the branch will go, issue data flow instructions from that branch destination, along with a special "verify that I branched to the right place" instruction, which is basically just the compare portion of the branch. ARM CPUs do the same thing. In both X86 and ARM CPUs, the data flow instructions that the CPU actually executes look different (are lower level, have more registers) than the original instruction set.

This means that there is no need to translate branch destinations. There's never a place in the CPU that has to take a branch destination (an integer address in virtual memory) in your X86 instruction stream and work out what the corresponding branch destination is in the lower-level data flow stream. This is because the data flow stream doesn't branch; it only speculates.

On the other hand, a DBT has to have a story for translating branch destinations, and it does have to target a full instruction set that does have branching.

That said, I don't know what the Transmeta CPUs did. Maybe they had a low-level instruction set that had all sorts of hacks to help the translation layer avoid the problems of branch destination translation.

replies(1): >>45903210 #
8. monocasa ◴[] No.45903210[source]
> That said, I don't know what the Transmeta CPUs did. Maybe they had a low-level instruction set that had all sorts of hacks to help the translation layer avoid the problems of branch destination translation.

Fixed guest branches just get turned into host branches and work like normal.

Indirect guest branches would get translated through a hardware jump address cache that was structured kind of like TLB tag lookups are.

replies(1): >>45903329 #
9. monocasa ◴[] No.45903282{3}[source]
There's levels of microcode.

It's not too uncommon for each pipeline stage or so to have their own uop formats as each stage computes what it was designed to and culls what later stages don't need.

Because of this it's not that weird to see both a single rmw uops at, says the initial decode and microcode layer, that then gets cracked into the different uops for the different functional units later on.

10. pizlonator ◴[] No.45903329{3}[source]
Thank you for sharing!

> Fixed guest branches just get turned into host branches and work like normal.

How does that work in case of self-modifying code, or skewed execution (where the same x86 instruction stream has two totally different interpretations based on what offset you start at)?

replies(1): >>45904427 #
11. monocasa ◴[] No.45904427{4}[source]
Skewed execution are just different traces. Basic blocks don't have a requirement that they don't partially overlap with other basic blocks. You want that anyway for optimization reasons even without skewed execution.

Self modifying code is handled with MMU traps on the writes, and invalidation of the relevant traces. It is very much a slow path though. Ideally heavy self modfying code is able to stay in the interpreter though and not thrash in and out of the compiler.

replies(1): >>45905474 #
12. pizlonator ◴[] No.45905474{5}[source]
> Self modifying code is handled with MMU traps on the writes, and invalidation of the relevant traces. It is very much a slow path though. Ideally heavy self modfying code is able to stay in the interpreter though and not thrash in and out of the compiler.

This might end up having a bad time running JavaScript VM JITed code, which self-modifies a lot.

But all of that makes sense! Thanks!

replies(1): >>45905778 #
13. monocasa ◴[] No.45905778{6}[source]
Yeah, nesting JITs was kind of always an Achilles heel of this kind of architecture.

IIRC, they had a research project to look at shipping a custom JVM that compiled straight to their internal ISA to skip the impedance mismatch between two JITs. JITed JS (or really any extremely dynamic code that also asks for high perf) probably wasn't even on their radar given the era with even the SmallTalk VM that HotSpot derived from being a strongly typed derivative of SmallTalk.