Most active commenters
  • ajross(4)
  • kimixa(3)

←back to thread

331 points giuliomagnifico | 25 comments | | HN request time: 0.747s | source | bottom
Show context
ndiddy ◴[] No.45377533[source]
Fun fact: Bob Colwell (chief architect of the Pentium Pro through Pentium 4) recently revealed that the Pentium 4 had its own 64-bit extension to x86 that would have beaten AMD64 to market by several years, but management forced him to disable it because they were worried that it would cannibalize IA64 sales.

> Intel’s Pentium 4 had our own internal version of x86–64. But you could not use it: we were forced to “fuse it off”, meaning that even though the functionality was in there, it could not be exercised by a user. This was a marketing decision by Intel — they believed, probably rightly, that bringing out a new 64-bit feature in the x86 would be perceived as betting against their own native-64-bit Itanium, and might well severely damage Itanium’s chances. I was told, not once, but twice, that if I “didn’t stop yammering about the need to go 64-bits in x86 I’d be fired on the spot” and was directly ordered to take out that 64-bit stuff.

https://www.quora.com/How-was-AMD-able-to-beat-Intel-in-deli...

replies(11): >>45377674 #>>45377914 #>>45378427 #>>45378583 #>>45380663 #>>45382171 #>>45384182 #>>45385968 #>>45388594 #>>45389629 #>>45391228 #
1. kimixa ◴[] No.45380663[source]
That's no guarantee it would succeed though - AMD64 also cleaned up a number of warts on the x86 architecture, like more registers.

While I suspect the Intel equivalent would do similar things, simply from being a big enough break it's an obvious thing to do, there's no guarantee it wouldn't be worse than AMD64. But I guess it could also be "better" from a retrospective perspective.

And also remember at the time the Pentium 4 was very much struggling to get the advertised performance. One could argue that one of the major reasons that the AMD64 ISA took off is that the devices that first supported it were (generally) superior even in 32-bit mode.

EDIT: And I'm surprised it got as far as silicon. AMD64 was "announced" and the spec released before the pentium 4 was even released, over 3 years before the first AMD implementations could be purchased. I guess Intel thought they didn't "need" to be public about it? And the AMD64 extensions cost a rather non-trivial amount of silicon and engineering effort to implement - did the plan for Itanium change late enough in the P4 design that it couldn't be removed? Or perhaps this all implies it was a much less far-reaching (And so less costly) design?

replies(5): >>45381174 #>>45381211 #>>45384598 #>>45385380 #>>45386422 #
2. chasil ◴[] No.45381174[source]
The times that I have used "gcc -S" on my code, I have never seen the additional registers used.

I understand that r8-r15 require a REX prefix, which is hostile to code density.

I've never done it with -O2. Maybe that would surprise me.

replies(3): >>45381498 #>>45381833 #>>45387856 #
3. ghaff ◴[] No.45381211[source]
As someone who followed IA64/Itanium pretty closely, it's still not clear to me the degree to which Intel (or at least groups within Intel) thought IA64 was a genuinely better approach and the degree to which Intel (or at least groups within Intel) simply wanted to get out from existing cross-licensing deals with AMD and others. There were certainly also existing constraints imposed by partnerships, notably with Microsoft.
replies(2): >>45381402 #>>45382598 #
4. ajross ◴[] No.45381402[source]
Both are likely true. It's easy to wave it away in hindsight, but there was genuine energy and excitement about the architecture in its early days. And while the first chips were late and on behind-the-cutting-edge processes they were actually very performant (FPU numbers were world-beating, even -- parallel VLIW dispatch really helped here).

Lots of people loved Itanium and wanted to see it succeed. But surely the business folks had their own ideas too.

replies(3): >>45381455 #>>45383151 #>>45383639 #
5. kimixa ◴[] No.45381455{3}[source]
Yes - VLIW seems to lend itself to computation-heavy code, used to this day in many DSP (and arguably GPU, or at least "influences" many GPU) architectures.
6. astrange ◴[] No.45381498[source]
You should be able to see it. REX prefixes cost a lot less than register spills do.

If you mean literally `gcc -S`, -O0 is worse than not optimized and basically keeps everything in memory to make it easier to debug. -Os is the one with readable sensible asm.

replies(1): >>45381519 #
7. chasil ◴[] No.45381519{3}[source]
Thanks, I'll give it a try.
8. o11c ◴[] No.45381833[source]
Obviously it depends on how many live variables there are at any point. A lot of nasty loops have relatively few non-memory operands involved, especially without inlining (though even without inlining, the ability to control ABI-mandated spills better will help).

But it's guaranteed to use `r8` and `r9` for for a function that takes 5 and 6 integer arguments (including unpacked 128-bit structs as 2 arguments), or 3 and 4 arguments (not sure about unpacking) for Microsoft. And `r10` is used if you make a system call on Linux.

9. tw04 ◴[] No.45382598[source]
Given that Itanium originated at HP, it seems unlikely it was about AMD and more about the fact, at the time, Intel was struggling with 64-bit. People are talking about the P4 but Itanium architecture dates back to the late 80s…

https://en.m.wikipedia.org/wiki/Itanium

replies(1): >>45390482 #
10. ccgreg ◴[] No.45383151{3}[source]
> they were actually very performant

Insanely expensive for that performance. I was the architect of HPC clusters in that era, and Itanic never made it to the top for price per performance.

Also, having lived through the software stack issues with the first beta chips of Itanic and AMD64 (and MIPS64, but who's counting), AMD64 was way way more stable than the others.

11. pjmlp ◴[] No.45383639{3}[source]
I am one of those people, and I think that it only failed because AMD had the possible to turn the tables on Intel, to use the article's title.

Without AMD64, I firmly believe eventually Itanium would have been the new world no matter what.

We see this all the time, technology that could be great but fails due to not being pushed hard enough, and other similar technology that does indeed succeed because the creators are willing push it at a loss during several years until it finally becomes the new way.

replies(3): >>45387412 #>>45389086 #>>45403145 #
12. tuyiown ◴[] No.45384598[source]
> first supported it were (generally) superior even in 32-bit mode.

They also were affordable dual cores, it wasn't the norm at all at the time.

13. p_l ◴[] No.45385380[source]
Pentium 4 was widely speculated of being able to run 64bit at the time of AMD64 delivering, but at half the speed.

Essentially, while decoding a 64bit variant of x86 ISA might have been fused off, there was a very visible part that was common anyway, and that was available ALUs on NetBurst platform - which IIRC were 2x 32bit ALUs for integer ops. So you either issue micro-op to both to "chain" them together, or run every 64bit calculation in multiple steps.

replies(1): >>45388690 #
14. kouteiheika ◴[] No.45386422[source]
> That's no guarantee it would succeed though - AMD64 also cleaned up a number of warts on the x86 architecture, like more registers.

As someone who works with AMD64 assembly very often - they didn't really clean it up all that much. Instruction encoding is still horrible, you still have a bunch of useless instructions even in 64-bit mode which waste valuable encoding space, you still have a bunch of instructions which hardcode registers for no good reason (e.g. the shift instructions have a hardcoded rcx). The list goes on. They pretty much did almost the minimal amount of work to make it 64-bit, but didn't actually go very far when it comes to making it a clean 64-bit ISA.

I'd love to see what Intel came up, but I'd be surprised if they did a worse job.

15. ghaff ◴[] No.45387412{4}[source]
I'm inclined to agree and I've written as much. In a world where 64-bit x86 wasn't really an option, Intel and "the industry" would probably have eventually figured a way to make Itanium work well-enough and cost-effectively-enough and incremented over time. Some of the then-current RISC chips would probably have remained more broadly viable in that timeline but, in the absence of a viable alternative, 64-bit was going to happen and therefore probably Itanium.

Maybe ARM gets a real kick in the pants but high-performance server processors were probably too far in the future to play a meaningful role.

16. wat10000 ◴[] No.45387856[source]
I don't have gcc handy, but this bit of code pretty easily gets clang to use several of them:

    int f(int **x) {
        int *a = x[0]; int *b = x[1]; int *c = x[2]; int *d = x[3];
        puts("hello");
        return *a + *b + *c + *d;
    }
17. eigenform ◴[] No.45388690[source]
Yeah, they wrote a paper about the ALUs too, see:

https://ctho.org/toread/forclass/18-722/logicfamilies/Delega...

> There are two distinct 32-bit FCLK execution data paths staggered by one clock to implement 64-bit operations.

If it weren't fused off, they probably would've supported 64-bit ops with an additional cycle of latency?

replies(1): >>45389997 #
18. Agingcoder ◴[] No.45389086{4}[source]
There was a fundamental difficulty with ‘given a sufficiently smart compiler’ if I remember well revolving around automatic parallelization. You might argue that given enough time and money it might have been solved, but it’s a really hard problem.

( I might have forgotten)

replies(1): >>45389534 #
19. ajross ◴[] No.45389534{5}[source]
The compilers did arrive, but obviously too late. Modern pipeline optimization and register scheduling in gcc & LLVM is wildly more sophisticated than anything people were imagining in 2001.
replies(1): >>45392989 #
20. p_l ◴[] No.45389997{3}[source]
At least one cycle, yes, but generally it would make it possible to deliver. AFAIK it also became crucial part of how intel could deliver "EM64T" chips fast enough - only to forget to upgrade memory subsystem which is why first generation can't run Windows (they retained 36bit physical addressing from PAE when AMD64 mandates minimum of 40, and Windows managed to trigger an issue on that)
21. mwpmaybe ◴[] No.45390482{3}[source]
For context, it was intended to be the successor to PA-RISC and compete with DEC Alpha.
22. kimixa ◴[] No.45392989{6}[source]
But modern CPUs have even more capabilities on re-ordering/OOO execution and other "live" scheduling work. They will always have more information available than a ahead-of-time static scheduling from the compiler, as so much is data dependent. If it wasn't worth it they would be slashing those capabilities instead.

Statically scheduled/in order stuff is still relegated to pretty much microcontroller, or specific numeric workloads. For general computation, it still seems like a poor fit.

replies(1): >>45403960 #
23. thesz ◴[] No.45403145{4}[source]

  > Without AMD64, I firmly believe eventually Itanium would have been the new world no matter what.
VLIW is not binary forward- or cross-implementation-compatible. If MODEL1 has 2 instruction per block and its successor MODEL2 has 4, the code for MODEL1 will be run on MODEL2, but it will underperform due to underutilization. If execution latencies differ between two versions of the same VLIW ISA implementation, the code for one may not be executed optimally on another. Even different memory controllers and cache hierarchies can change optimal VLIW code.

This precludes any VLIW from having multiple differently constrained implementations. You cannot segment VLIW implementations you can do with as x86, ARM, MIPS, PowerPC, etc, where same code will be executed as optimal as possible on the concrete implementation of ISA.

So - no, Itanium (or any other VLIW for that matter) would not be the new world.

replies(1): >>45403947 #
24. ajross ◴[] No.45403947{5}[source]
> VLIW is not binary forward- or cross-implementation-compatible.

It was on IA-64, the bundle format was deliberately chosen to allow for easy extension.

But broadly it's true: you can't have a "pure" VLIW architecture independent of the issue and pipeline architecture of the CPU. Any device with differing runtime architecture is going to have to do some cooking of the instructions to match it to its own backend. But that decode engine is much easier to write when it's starting from a wide format that presents lots of instructions and makes explicit promises about their interdependencies.

25. ajross ◴[] No.45403960{7}[source]
That's true. But if anything that cuts in the opposite direction in the argument: modern CPUs are doing all that optimization in hardware, at runtime. In software it's a no-brainer in comparison.