Athlon 64: How AMD turned the tables on Intel

(dfarq.homeip.net)

331 points giuliomagnifico | 1 comments | 25 Sep 25 18:09 UTC | HN request time: 0s | source

Show context

bigstrat2003 ◴[25 Sep 25 19:20 UTC] No.45377613[source]▶

I remember at the time thinking it was really silly for Intel to release a 64-bit processor that broke compatibility, and was very glad AMD kept it. Years later I learned about kernel writing, and I now get why Intel tried to break with the old - the compatibility hacks piled up on x86 are truly awful. But ultimately, customers don't care about that, they just want their stuff to run.

replies(5): >>45377925 #>>45379301 #>>45380247 #>>45385323 #>>45386390 #

Romario77 ◴[26 Sep 25 13:40 UTC] No.45386390[source]▶

>>45377613 #

It wasn't just incompatibility, it was some of the design decisions that made it very hard to make performant code that runs well on Itanium.

Intel made a bet on parallel processing and compilers figuring out how to organize instructions instead of doing this in silicone. It proved to be very hard to do, so the supposedly next gen processors turned out to be more expensive and slower than the last gen or new AMD ones.

replies(1): >>45388162 #

1. cameldrv ◴[26 Sep 25 16:16 UTC] No.45388162[source]▶

>>45386390 #

Yeah the biggest idea was essentially to do the scheduling of instructions upfront in the compiler instead of dynamically at runtime. By doing this, you can save a ton of die area for control and put it into functional units doing math etc.

The problem as far as I can tell as a layman is that the compiler simply doesn't have enough information to do this job at compile time. The timing of the CPU is not deterministic in the real world because caches can miss unpredictably, even depending on what other processes are running at the same time on the computer. Branches also can be different depending on the data being processed. Branch predictors and prefetchers can optimize this at runtime using the actual statistics of what's happening in that particular execution of the program. Better compilers can do profile directed optimization, but it's still going to be optimized for the particular situation the CPU was in during the profile run(s).

If you think of a program like an interpreter running a tight loop in an interpreted program, a good branch predictor and prefetcher are probably going to be able to predict fairly well, but a statically scheduled CPU is in trouble because at the compile time of the interpreter, the compiler has no idea what program the interpreter is going to be running.

↑