I'm waiting for the shoe to drop when someone comes out with an FPGA optimized for reconfigurable computing and lowers the cost of llm compute by 90% or better.
Raw gemm computation was never the real bottleneck, especially on the newer GPUs. Feeding the matmuls i.e memory bandwidth is where it’s at, especially in the newer GPUs.