Most active commenters
  • hinkley(5)
  • monocasa(4)
  • pizlonator(3)

←back to thread

184 points onename | 39 comments | | HN request time: 2.094s | source | bottom
1. gdiamos ◴[] No.45898849[source]
Transmeta made a technology bet that dynamic compilation could beat OOO super scalar CPUs in SPEC.

It was wrong, but it was controversial among experts at the time.

I’m glad that they tried it even though it turned out to be wrong. Many of the lessons learned are documented in systems conferences and incorporated into modern designs, ie GPUs.

To me transmeta is a great example of a venture investment. If it would have beaten Intel at SPEC by a margin, it would have dominated the market. Sometimes the only way to get to the bottom of a complex system is to build it.

The same could be said of scaling laws and LLMs. It was theory before Dario, Ilya, OpenAI, et al trained it.

replies(7): >>45898875 #>>45899126 #>>45899335 #>>45901599 #>>45902119 #>>45903852 #>>45906222 #
2. vlovich123 ◴[] No.45898875[source]
I think more about the timing being incorrect - betting on software in an era of exponential hardware growth was unwise (software performance can’t scale that way). The problem is that you need to marry it with a significantly better CPU/architecture because the JIT is about not losing performance while retaining back compat.

However, if you add it onto a better CPU it’s a fine technique to bet on - case in point Apple’s move away from Intel onto homegrown CPUs.

replies(3): >>45902003 #>>45902290 #>>45905590 #
3. pshirshov ◴[] No.45899126[source]
Aren't modern CPUs, essetially, dynamic translators from x86_64 instruction set into internal RISC-like intsruction sets?
replies(3): >>45899325 #>>45901534 #>>45901535 #
4. p_l ◴[] No.45899325[source]
Not to the same level. Crusoe was, in many ways, more classic CISC than x86 - except it's microcode was actually doing dynamic translation to internal ISA instead of operating like interpreter in old CISCs.

x86 ISA had the funny advantage of being way closer to RISC than "beloved" CISC architectures of old like m68k or VAX. Many common instructions translate to single "RISCy" instruction for the internal microarchitecture (something AMD noted IIRC in the original K5 with its AMD29050-derived core as "most instructions translate to 1 internal microinstruction, some between 2 to 4"). X86 prefixes are also way simpler than the complicated logic of decoding m68k or VAX. An instruction with multiple prefixes will quite probably decode to single microinstruction.

That said, there's funny thing in that Transmeta tech survived quite a long way to the point that there were Android tablets, in fact flagship Google ones like Nexus 9, whose CPU was based on it - because nvidia "Denver" architecture used same technology (AFAIK licensed from Transmeta, but don't cite me on this)

replies(2): >>45899500 #>>45901247 #
5. rjsw ◴[] No.45899335[source]
They were also the first to produce an x86 CPU with an integrated northbridge, they could have pitched it more at embedded and industrial markets where SPEC scores are less important.
replies(1): >>45902223 #
6. mananaysiempre ◴[] No.45899500{3}[source]
> Many common [x86] instructions translate to single "RISCy" instruction for the internal microarchitecture

And then there are read-modify-write instructions, which on modern CPUs need two address-generation μops in addition to the load one, the store one, and the ALU one. So the underlying load-store architecture is very visible.

There’s also the part where we’ve trained ourselves out of using the more CISCy parts of x86 like ENTER, BOUND, or even LOOP, because they’ve been slow for ages, and thus they stay slow.

replies(2): >>45899729 #>>45903282 #
7. p_l ◴[] No.45899729{4}[source]
Even many of the more complex instructions often can translate into surprisingly short sequences - all sorts of loop structures have now various kinds of optimizations including instruction fusion that probably would not be necessary if we didn't stop using higher level LOOP constructs ;-)

But for example REP MOVS now is fused into equivalent of using SSE load-stores (16 bytes) or even AVX-512 load stores (64 bytes).

And of course equivalent of LEA by using ModRM/SIB prefixes is pretty much free with it being AFAIK handled as pipeline step

8. taolson ◴[] No.45901247{3}[source]
>something AMD noted IIRC in the original K5 with its AMD29050-derived core

Just a small nitpick: I've seen the K5/29050 connection mentioned in a number of places, but the K5 was actually based upon an un-released superscalar 29K project called "Jaguar", not the 29050, which was a single-issue, in-order design.

9. JoshTriplett ◴[] No.45901535[source]
Modern CPUs still translate individual instructions to corresponding micro-ops, and do a bit of optimization with adjacent micro-ops. Transmeta converted whole regions of code at a time, and I think it tried to do higher-level optimizations.
10. pizlonator ◴[] No.45901534[source]
Folks like to say that, but that's not what's happening.

The key difference is: what is an instruction set? Is it a Turing-complete thing with branches, calls, etc? Or is it just data flow instructions (math, compares, loads and stores, etc)?

X86 CPUs handle branching in the frontend using speculation. They predict where the branch will go, issue data flow instructions from that branch destination, along with a special "verify that I branched to the right place" instruction, which is basically just the compare portion of the branch. ARM CPUs do the same thing. In both X86 and ARM CPUs, the data flow instructions that the CPU actually executes look different (are lower level, have more registers) than the original instruction set.

This means that there is no need to translate branch destinations. There's never a place in the CPU that has to take a branch destination (an integer address in virtual memory) in your X86 instruction stream and work out what the corresponding branch destination is in the lower-level data flow stream. This is because the data flow stream doesn't branch; it only speculates.

On the other hand, a DBT has to have a story for translating branch destinations, and it does have to target a full instruction set that does have branching.

That said, I don't know what the Transmeta CPUs did. Maybe they had a low-level instruction set that had all sorts of hacks to help the translation layer avoid the problems of branch destination translation.

replies(1): >>45903210 #
11. actionfromafar ◴[] No.45901599[source]
Did anyone try dynamic recompilation from x86 to x86? Like a JIT taking advantage of the fact that the target ISA is compatible with with the source ISA.
replies(3): >>45901792 #>>45902549 #>>45905791 #
12. solarexplorer ◴[] No.45901792[source]
Yes, I think the conclusion was that it did improve performance on binaries that were not compiled with optimizations, but didn't generate enough gains on optimized binaries to set of the cost of re-compilation.

https://dl.acm.org/doi/10.1145/358438.349303

(this is not about x86 but PA-RISC, but the conclusions would likely be very similar...)

13. cpgxiii ◴[] No.45902003[source]
> However, if you add it onto a better CPU it’s a fine technique to bet on - case in point Apple’s move away from Intel onto homegrown CPUs.

I don't think Apple is a good example here. Arm was extremely well-established when Apple began its own phone/tablet CPU designs. By the time Macs began to transition, much of their developer ecosystem was already familiar.

Apple's CPUs are actually notably conservative when compared to the truly wild variety of Arm implementations; no special vector instructions (e.g. SVE), no online translation (e.g. Nvidia Denver), no crazy little/big/bigger core complexes.

replies(2): >>45902272 #>>45905624 #
14. btilly ◴[] No.45902119[source]
That's kind of the bet they made, but misses a key point.

Their fundamental idea was that by having simpler CPUs, they could iterate on Moore's law more quickly. And eventually they would win on performance. Not just on a few speculative edge cases, but overall. The dynamic compilation was needed to be able to run existing software on it.

The first iterations, of course, would be slower. And so their initial market, needed to afford those software generations, would be use cases for low power. Because the complexity of a CISC chip made that a weak point for Intel.

They ran into a number of problems.

The first is that the team building that dynamic compilation layer was more familiar with the demands of Linux than Windows, with the result that the compilation worked better for Linux than Windows.

The second problem was that the "simple iterates faster" also turns out to be true for ARM chips. And the most profitable segments of that low power market turned out to be willing to rewrite their software for that use case.

And the third problem is that Intel proved to be able to address their architectural shortcomings by throwing enough engineers at the problem to iterate faster.

If Transmeta had won its bet, they would have completely dominated. But they didn't.

It is worth noting that Apple pursued a somewhat similar idea with Rosetta. Both in changing to Intel, and later changing to ARM64. With the crucial difference that they also controlled the operating system. Meaning that instead of constantly dynamically compiling, they could rely on the operating system to decide what needs to be compiled, when, and call it correctly. And they also better understood what to optimize for.

replies(2): >>45903370 #>>45905665 #
15. buildbot ◴[] No.45902223[source]
They did! There are many transmeta powered thin clients for example.
replies(1): >>45905277 #
16. almostgotcaught ◴[] No.45902272{3}[source]
> no special vector instructions (e.g. SVE)

Wut - SVE and SME are literally Apple designs (AMX) which have been "back ported".

replies(1): >>45902940 #
17. tracker1 ◴[] No.45902290[source]
Exactly... I think that if you look at the accelerator paths that Apple's chips have for x86 emulation combined with software it's pretty nifty. I do wish these were somewhat standardized/licensed/upstreamed so that other arm vendors could use them in a normalized way.
18. tgma ◴[] No.45902549[source]
Notably VMware and alike in pre-hardware virtualization era did something like that to run x86 programs fast under virtualization instead of interpreting x86 through emulation.
19. cpgxiii ◴[] No.45902940{4}[source]
> Wut - SVE and SME are literally Apple designs (AMX) which have been "back ported".

Literally no Apple CPUs meaningfully support SVE or SVE2. Apple adds what I would say is a relatively "conventional" matrix instructions (AMX) of their own, and now implements SME and SME2, but those are not equivalent to SVE (I call AMX "conventional" in the sense that a fixed-size grid of matrix compute elements is not a particularly new idea, versus variable-sized SIMD which is still quite rare. Really, the only arm64 design with "full fat" SVE support is Fujitsu's a64fx (512-bit vector size); everything else on the very short list of hardware supporting SVE is still stuck with 128-bit vectors.

20. monocasa ◴[] No.45903210{3}[source]
> That said, I don't know what the Transmeta CPUs did. Maybe they had a low-level instruction set that had all sorts of hacks to help the translation layer avoid the problems of branch destination translation.

Fixed guest branches just get turned into host branches and work like normal.

Indirect guest branches would get translated through a hardware jump address cache that was structured kind of like TLB tag lookups are.

replies(1): >>45903329 #
21. monocasa ◴[] No.45903282{4}[source]
There's levels of microcode.

It's not too uncommon for each pipeline stage or so to have their own uop formats as each stage computes what it was designed to and culls what later stages don't need.

Because of this it's not that weird to see both a single rmw uops at, says the initial decode and microcode layer, that then gets cracked into the different uops for the different functional units later on.

22. pizlonator ◴[] No.45903329{4}[source]
Thank you for sharing!

> Fixed guest branches just get turned into host branches and work like normal.

How does that work in case of self-modifying code, or skewed execution (where the same x86 instruction stream has two totally different interpretations based on what offset you start at)?

replies(1): >>45904427 #
23. hedgehog ◴[] No.45903370[source]
I don't know if the bet was even particularly wrong. If they had done a little better job on performance, capitalized on the pains of Netburst + AMD64 transition, and survived long enough to do integrated 3D graphics and native libraries for Javascript + media decoding it might have worked out fine. That alternate universe might have involved a merger with Imagination when the Kyro was doing poorly and the company had financial pain. We'll never know.
replies(1): >>45903754 #
24. btilly ◴[] No.45903754{3}[source]
I don't either. Even with their problems, they didn't miss by much.

One key factor against them, though, is that they were facing a company whose long-term CEO had written Only The Paranoid Survive. At that point he had moved from being the CEO to the chairman of the board. But Intel had paranoia about possible existential threats baked into its DNA.

There is no question that Intel recognized Transmeta as a potential existential threat, and aggressively went after the very low-power market that Transmeta was targeting. Intel quickly created SpeedStep, allowing power consumption to dynamically scale when not under peak demand. This improved battery life on laptops using the Pentium III, without sacrificing peak performance. They went on to produce low power chips like the Pentium M that did even better on power.

Granted, Intel never managed to match the low power that Transmeta had. But they managed to limit Transmeta enough to cut off their air supply - they couldn't generate the revenue needed to invest enough to iterate as quickly as they needed to. This isn't just a story of Transmeta stumbling. This is also a story of Intel recognizing and heading off a potential threat.

25. fajitaforce5 ◴[] No.45903852[source]
I was an intel cpu architect when transmeta started making claims. We were baffled by those claims. We were pushing the limit of our pipelines to get incremental gains and they were claiming to beat a dedicated arch on the fly! None of their claims made sense to ANYONE with a shred of cpu arch experience. I think your summary has rose colored lenses, or reflects the layman’s perspective.
replies(4): >>45904343 #>>45904657 #>>45905133 #>>45905527 #
26. gdiamos ◴[] No.45904343[source]
It was risky.

From my perspective it was more exciting to the programming systems and compiler community than to the computer architecture community.

27. monocasa ◴[] No.45904427{5}[source]
Skewed execution are just different traces. Basic blocks don't have a requirement that they don't partially overlap with other basic blocks. You want that anyway for optimization reasons even without skewed execution.

Self modifying code is handled with MMU traps on the writes, and invalidation of the relevant traces. It is very much a slow path though. Ideally heavy self modfying code is able to stay in the interpreter though and not thrash in and out of the compiler.

replies(1): >>45905474 #
28. nostrademons ◴[] No.45904657[source]
I think this is a classic hill-climbing dilemma. If you start in the same place, and one org has worked very hard and spent a lot of money optimizing the system, they will probably come out on top. But if you start in a different place, reimagining the problem from first principles, you may or may not find yourself with a taller hill to climb. Decisions made very early on in your hill-climbing process lock you in to a path, and then the people tasked with optimizing the system later can't fight the organizational inertia to backtrack and pick a different path. But a new startup can.

It's worth noting that Google actually did succeed with a wildly different architecture a couple years later. They figured "Well, if CPU performance is hitting a wall - why use just one CPU? Why not put together thousands of commodity CPUs that individually are not that powerful, and then use software to distribute workloads across those CPUs?" And the obvious objection to that is "If we did that, it won't be compatible with all the products out there that depend upon x86 binary compatibility", and Google's response was the ultimate in hubris: "Well we'll just build new products then, ones that are bigger and better than the whole industry." Miraculously it worked, and made a multi-trillion-dollar company (multiple multi-trillion-dollar companies, if you now consider how AWS, Facebook, TSMC, and NVidia revenue depends upon the cloud).

Transmeta's mistake was that they didn't re-examine enough assumptions. They assumed they were building a CPU rather than an industry. If they'd backed up even farther they would've found that there actually was fertile territory there.

replies(1): >>45905542 #
29. empw ◴[] No.45905133[source]
Wasn't Intel trying to do something similar in Itanium i.e. use software to translate code into VLIW instructions to exploit many parallel execution units? Only they wanted the C++ compiler to do it rather than a dynamic recompiler? At least some people in Intel thought that was a good idea.

I wonder if the x86 teams at Intel people were similarly baffled by that.

30. giantrobot ◴[] No.45905277{3}[source]
And UMPCs. Sony made at least one of the PictureBooks with a Transmeta CPU and IIRC their U1 used it as well.
31. pizlonator ◴[] No.45905474{6}[source]
> Self modifying code is handled with MMU traps on the writes, and invalidation of the relevant traces. It is very much a slow path though. Ideally heavy self modfying code is able to stay in the interpreter though and not thrash in and out of the compiler.

This might end up having a bad time running JavaScript VM JITed code, which self-modifies a lot.

But all of that makes sense! Thanks!

replies(1): >>45905778 #
32. hinkley ◴[] No.45905527[source]
The Itanium felt like Intel trying on the same bet - move the speculative and analysis logic into the compiler and off the CPU. But where it differed is that it tried to leave some internal implementation details of that decoding process exposed so the compiler could call it directly, in a way that transmeta didn’t manage.

I wonder how long before we try it again.

33. hinkley ◴[] No.45905542{3}[source]
> Well we'll just build new products then, ones that are bigger and better than the whole industry.

With blackjack, and hookers!

34. hinkley ◴[] No.45905590[source]
Would TSMC be further along today, or not, if Transmeta had been thought up five, ten years later? Would Transmeta be farther along for having encountered a more mature TSMC?

TSMC seems to have made a lot of bones on Arm and Apple’s time.

35. vlovich123 ◴[] No.45905624{3}[source]
I think your focusing on the details and missing my broader point - the JIT technique for translation only works to break out of the instruction set lock-in. It does not improve performance, so betting on that instead of super scalar designs is not wise.

Transmeta’s CPU was not performance competitive and thus had no path to success.

And as for Apple itself, they had built the first iPhone on top of ARM to begin with (partially because Intel didn’t see a market). So they were already familiar with ARM before they even started building ARM CPUs. But also the developer ecosystem familiarity is only partially relevant - even in compat mode the M1 ran faster than equivalent contemporary Intel chips. So the familiarity was only needed to unlock the full potential (most of which was done by Apple porting 1p software). But even if they had never switched on ARM support in the M1 the JIT technique (compiled with a better CPU and better unified memory architecture) would still have been fast enough to slightly outcompete Intel chips on performance and battery life - native software just made it 0 competition.

36. hinkley ◴[] No.45905665[source]
Intel was already built on the Pentium at this point. Not as iterable as pure software but decoding x86 instructions to whatever they wanted to do internally sped up a lot of things on its own.

Perhaps they would have been better off building the decode logic as programmable by making effectively a multicore machine where the translation code ran on its own processor with its own cache, instead of a pure JIT.

37. monocasa ◴[] No.45905778{7}[source]
Yeah, nesting JITs was kind of always an Achilles heel of this kind of architecture.

IIRC, they had a research project to look at shipping a custom JVM that compiled straight to their internal ISA to skip the impedance mismatch between two JITs. JITed JS (or really any extremely dynamic code that also asks for high perf) probably wasn't even on their radar given the era with even the SmallTalk VM that HotSpot derived from being a strongly typed derivative of SmallTalk.

38. hinkley ◴[] No.45905791[source]
I believe it was HP who accidentally tried this while making an early equivalent of Rosetta to deal with a hardware change on their mainframes and mini computers. They modified it to run same-same translations and they did get notable performance improvements by doing so.

I’m pretty sure this experiment happened before Transmeta existed, or when it was still forming. So it ended up being evidence that what they were doing might work. It also was evidence that Java wasn’t completely insane to exist.

39. andrewf ◴[] No.45906222[source]
At the time I recall https://dl.acm.org/doi/pdf/10.1145/301631.301683 being an oft-discussed data point - speeding up DEC Alpha code by recompiling it into different DEC Alpha code using runtime statistics.

This was commonly cited in forum debates about whether Java and C# could come close to the performance of compiled languages. ("JITs and GCs are fast enough, and runtime stats mean they can even be faster!" was a common refrain, but not actually as true in 1999 as it is in 2025)