I regret building this $3000 Pi AI cluster

> goes round-robin style asking each node to perform its prompt processing, then token generation.

Yeah, this is a now-long-wide-known issue with LLM processing. This can be remediated so that all nodes split computation, but then you'll come back to classical supercomputing problem of node interconnect latency/bandwidth bottlenecks.

It looks to me that many such interconnect simulate Ethernet cards. I wonder if it can be recreated using the M.2 slot rather than using that slot for node-local data, and cost effectively so(like cheaper than bunch of 10GE cards and short DACs).