I watched the video and enjoyed it, I think the most interesting part to me was running the distributed Llama.cpp, Jeff mentioned it seems to work in a linear fashion where processing would hop between nodes.
Which got me thinking about how do these frontier AI models work when you (as a user) run a query. Does your query just go to one big box with lots of GPUs attached and it runs in a similar way, but much faster? Do these AI companies write about how their infra works?
replies(1):