I started programming on an 8 MHz Mac Plus in the late 1980s and got a bachelors degree in computer engineering in the late 1990s. From my perspective, a kind of inverse Moore's Law happened, where single-threaded performance stays approximately constant as the number of transistors doubles every 18 months.
Wondering why that happened is a bit like asking how high the national debt would have to get before we tax rich people, or how many millions of people have to die in a holocaust before the world's economic superpowers stop it. In other words, it just did.
But I think that we've reached such an astounding number of transistors per chip (100 billion or more) that we finally have a chance to try alternative approaches that are competitive. Because so few transistors are in use per-instruction that it wouldn't take much to beat status quo performance. Note that I'm talking about multicore desktop computing here, not GPUs (their SIMD performance actually has increased).
I had hoped that FPGAs would allow us to do this, but their evolution seems to have been halted by the powers that be. I also have some ideas for MIMD on SIMD, which is the only other way that I can see this happening. I think if the author can reach the CMOS compatibility they spoke of, and home lithography could be provided by an open source device the way that 3D printing happened, and if we could get above 1 million transistors running over 100 MHz, then we could play around with cores having the performance of a MIPS, PowerPC or Pentium.
In the meantime, it might be fun to prototype with AI and build a transputer at home with local memories. Looks like a $1 Raspberry Pi RP2040 (266 MIPS, 2 core, 32 bit, 264 kB on-chip RAM) could be a contender. It has about 5 times the MIPS of an early 32 bit PowerPC or Pentium processor.
For comparison, the early Intel i7-920 had 12,000 MIPS (at 64 bits), so the RP2040 is about 50 times slower (not too shabby for a $1 chip). But where the i7 had 731 million transistors, the RP2040 has only 134,000 (not a typo). So 50 times the performance for over 5000 times the number of transistors means that the i7 is only about 1% as performant as it should be per transistor.
I'm picturing an array of at least 256 of these low-cost cores and designing an infinite-thread programming language that auto-parallelizes code without having to manually use intrinsics. Then we could really start exploring stuff like genetic algorithms, large agent simulations and even artificial life without having to manually transpile our code to whatever non-symmetric multiprocessing runtime we're forced to use currently.