←back to thread

451 points imartin2k | 2 comments | | HN request time: 0.729s | source
Show context
einrealist ◴[] No.44478912[source]
The major AI gatekeepers, with their powerful models, are already experiencing capacity and scale issues. This won't change unless the underlying technology (LLMs) undergoes a fundamental shift. As more and more things become AI-enabled, how dependent will we be on these gatekeepers and their computing capacity? And how much will they charge us for prioritised access to these resources? And we haven't really gotten to the wearable devices stage yet.

Also, everyone who requires these sophisticated models now needs to send everything to the gatekeepers. You could argue that we already send a lot of data to public clouds. However, there was no economically viable way for cloud vendors to read, interpret, and reuse my data — my intellectual property and private information. With more and more companies forcing AI capabilities on us, it's often unclear who runs those models and who receives the data and what is really happening to the data.

This aggregation of power and centralisation of data worries me as much as the shortcomings of LLMs. The technology is still not accurate enough. But we want it to be accurate because we are lazy. So I fear that we will end up with many things of diminished quality in favour of cheaper operating costs — time will tell.

replies(3): >>44478949 #>>44479025 #>>44479921 #
kgeist ◴[] No.44479025[source]
We run our own LLM server at the office for a month now, as an experiment (for privacy/infosec reasons), and a single RTX 5090 is enough to serve 50 people for occasional use. We run Qwen3 32b which in some benchmarks is equivalent to GPT 4.1-mini or Gemini 2.5 Flash. The GPU allows 2 concurrent requests at the same time with 32k context each and 60 tok/s. At first I was skeptical a single GPU would be enough, but it turns out, most people don't use LLMs 24/7.
replies(3): >>44479225 #>>44480111 #>>44480983 #
pu_pe ◴[] No.44480111[source]
That's really great performance! Could you share more details about the implementation (ie which quantized version of the model, how much RAM, etc.)?
replies(1): >>44480736 #
kgeist ◴[] No.44480736[source]
Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.

replies(2): >>44481959 #>>44482838 #
1. oceansweep ◴[] No.44482838[source]
Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.
replies(1): >>44483295 #
2. kgeist ◴[] No.44483295[source]
We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)

From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.