←back to thread

468 points speckx | 1 comments | | HN request time: 0.268s | source
1. numpad0 ◴[] No.45303936[source]
> goes round-robin style asking each node to perform its prompt processing, then token generation.

Yeah, this is a now-long-wide-known issue with LLM processing. This can be remediated so that all nodes split computation, but then you'll come back to classical supercomputing problem of node interconnect latency/bandwidth bottlenecks.

It looks to me that many such interconnect simulate Ethernet cards. I wonder if it can be recreated using the M.2 slot rather than using that slot for node-local data, and cost effectively so(like cheaper than bunch of 10GE cards and short DACs).