Open WebUI, SillyTavern, and other frontends can access any OpenAI-compatible server, and on Nvidia cards you have a wealth of options that will run one of those servers for you: llama.cpp (or the Ollama wrapper), of course, but also the faster vLLM and SGLang inference engines. Buy one of these, slap SGLang or vLLM on it, and point your devices at your machine's local IP address.
I'm mildly skeptical about performance here: they aren't saying what the memory bandwidth is, and that'll have a major impact on tokens-per-second. If it's anywhere close to the 4090, or even the M2 Ultra, 128GB of Nvidia is a steal at $3k. Getting that amount of VRAM on anything non-Apple used to be tens of thousands of dollars.
(They're also mentioning running the large models at Q4, which will definitely hurt the model's intelligence vs FP8 or BF16. But most people running models on Macs runs them at Q4, so I guess it's a valid comparison. You can at least run a 70B at FP8 on one of these even with fairly large context size, which I think will be the sweet spot.)