Gary Tan's was right[1] in that there is a fundamental inefficiency inherent in the von Neumann architecture we're all using. This gross impedance mismatch[4] is a great opportunity for innovation.
Once ENIAC was "improved" from its original structure to a general purpose compute device in the von Neumann style, it suffered a 83% loss in performance[2] Everything since is 80 years of premature optimization that we need to unwind. It's the ultimate pile of technical debt.
Instead of throwing maximum effort into making specific workloads faster, why not build a chip that can make all workloads faster instead, and let economy of scale work for everyone?
I propose (and have for a while[3]) a general purpose solution.
A systolic array of simple 4 bits in, 4 bits out, Look Up Tables (LUTs) latched so that timing issues are eliminated, could greatly accelerate computation, in a far nearer timeframe.
The challenges are that it's a greenfield environment, with no compilers (though it's probable that LLVM could target it), and a bus number of 1.
[1] https://www.ycombinator.com/rfs-build#llms-for-chip-design
[2] https://en.wikipedia.org/wiki/ENIAC#Improvements
[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...